Using Scenario-Based Tasks to assess along a Learning Progression

Gabrielle Cayton-Hodges, ETS

Learning progressions (LPs; which are similar to learning trajectories) have been defined in many ways over decades of cognitive research and in different subject areas. At ETS, through the Cognitively-Based Assessment of, for, and as Learning, (CBAL™) we developed a singular definition to be used across content areas: “a description of qualitative change in a student’s level of sophistication for a key concept, process, strategy, practice, or habit of mind. Change in student standing on such a progression may be due to a variety of factors, including maturation and instruction. Each progression is presumed to be modal—i.e., to hold for most, but not all, students. Finally, it is provisional, subject to empirical verification and theoretical challenge” (ETS, 2012).

The development and documentation of LPs can be useful in the creation of proper diagnostic tools for both formative and summative assessment. By developing research-based LPs that are intended for use in assessment development and developing an assessment around them, the assessment system can provide teachers with information regarding the location of their students on the progression and from that derive information needed to move the students forward.

However, this is not always a simple and straight-forward task. Since LPs often articulate not just what students can do, but also what students understand as well as what misconceptions they may have, a single item, even with a solution or explanation, may only give us a small part of the story.

Scenario-Based Tasks (SBTs) lead students through a larger, often real-world context, in which the students are able to apply various aspects of subject matter knowledge towards solving the problem. SBTs and Learning Progressions fit nicely hand-in-hand as students demonstrate knowledge in context, and scaffolding can be applied as needed to determine the level at which a student can perform both with and without assistance. In mathematics, for example, since both SBTs and LPs are designed for one particular content area, they can be tailored to highlight a precise area of mathematics and even specific LP levels while filling in the other mathematical content knowledge needed for the particular problem. The SBT can also provide some amount of “branching” whereby students who do not need the scaffolding move on to show what they can do independently while others who may have been left floundering with an open-ended problem can be provided the results of some parts of the problem where they may be stumbling, allowing them to advance to other aspects of the problem that they may be able to solve without difficulty.

While it is true that there are multiple approaches to assessment along a Learning Progression, and SBTs alone may not provide all of the information we need, the incorporation of SBTs into summative or formative assessment can certainly help us to fill some inevitable gaps.

For more information on the design and development of assessments that incorporate SBTs, see Oranje, A., Keehner, M., Persky, H., Cayton‐Hodges, G., & Feng, G. (2016).



Educational Testing Service. (2012). Outline of provisional learning progressions. Retrieved from the The CBAL English language arts (ELA) compentency model and provisional learning progressions web site:

Oranje, A., Keehner, M., Persky, H., Cayton‐Hodges, G., & Feng, G. (2016). Educational Survey Assessments. In A. Rupp & Leighton, J. (Eds.), The Wiley Handbook of Cognition and Assessment: Frameworks, Methodologies, and Applications (pp. 427-445). Wiley-Blackwell.


Cognitively Diagnostic Feedback in Context

Maryam Wagner, McGill University

In general, feedback is information that is provided to learners following assessment. Arguably, feedback has the most impact and potential for contributing to advancing learning when it is used formatively because its primary purpose is aimed at modifying learners’ thinking or behaviour (Nicol & MacFarlane-Dick, 2006; Sadler, 1998; Shute, 2008). Cognitively diagnostic feedback (CDF) (Jang & Wagner, 2014; Wagner, 2015) brings together this formative potential alongside cognitively-based theories of diagnostic assessment (Alderson, 2005; Hartz & Roussos, 2008; Huhta, 2010; Jang, 2005; Leighton & Gierl, 2007; Nichols, Chipman & Brennan, 1995). CDF targets gaps in learners’ cognitive and processing and strategy use rather than knowledge gaps.

The characteristics of CDF can be discussed across several domains including purpose, content, and grain size (Jang & Wagner, 2014). The purpose of CDF is ultimately to advance learners’ self-regulated learning through provision of feedback that addresses conceptual errors, cognitive gaps and strategy use. The purpose and content of CDF is in contrast to feedback that delivers holistic judgements and is outcome-based. Another goal of CDF is to provide feedback that is fine-grained rather than coarse or excessively detailed that learners’ attention is drawn to micro aspects of their work. For example, CDF on writing would provide sub-skill specific (e.g., vocabulary use, content generation, organizational strategies) focusing on learners’ strengths and areas for improvement, rather than identifying typographical errors or misplaced commas (Wagner, 2015). A question that I have been grappling with recently is the extent to which the provision of feedback, and more specifically CDF, would/should be impacted by context.

I am a new scholar. My emergence and development in research has focused primarily in assessment in classroom-based educational settings. I have recently shifted my focus to include assessment in workplace-based contexts. Workplace-based contexts are characterized as ‘real-life’ settings in which learners are engaged in on-the-job tasks (Hamdy, 2009). Some examples include training contexts for physicians, nurses, and pilots. There are numerous similarities between these workplace-based contexts, and traditional classroom-based learning environments. For example, both contexts provide opportunities for in vivo or in situ assessments wherein teachers are directly observing tasks in the setting in which they are used (Hamdy, 2009; Wigglesworth, 2008). Another commonality is that in both contexts the curriculum, teaching and assessment need to be aligned to advance learning, and feedback needs to be delivered during and/or after assessment tasks (Norcini & Burch, 2007). Numerous other similarities exist; however, two of the primary differences between these two assessment contexts are: 1) the characteristics of the tasks; and 2) the agents delivering the feedback (Greenberg, 2012). Table 1 summarizes the similarities differences across these domains.

Table 1.

Task and Agent Characteristics in Workplace- and Classroom-Based Assessment Contexts

  Assessment Context
Workplaced-Based Classroom-Based
Task Characteristics ·      Primarily performance-based


·      Variety of task types employed including performance-based, essays, portfolios
·      Setting and content authentic to real-life situations (defines relationship between task and performance) (Bachman & Palmer, 1996; Wigglesworth, 2008) ·      Struggles to balance authenticity with generalizability of outcomes to specific contexts Wigglesworth, 2008)
·      Tasks serve as tools for eliciting samples for assessment and provision of feedback ·      Tasks serve as tools for eliciting samples for assessment and provision of feedback
Agent Characteristics ·      Assessment and subsequent provision of feedback is the primary responsibility of content experts (Greenberg, 2012) ·      Assessment and provision of feedback is the primary responsibility of task experts (Greenberg, 2012)
·      Assessments are driven by external stakeholders who define requisite knowledge and skills ·      Teachers drive assessment and the type of feedback generated
  ·      Frequently employs peer- and self-assessments

The use of tasks is similar across both contexts: it is primarily used for eliciting evidence of learning and generating opportunities for feedback. However, the nature of the tasks are not necessarily identical. While workplaced-based settings employ primarily performance-based tasks that replicate real-life, classroom-based contexts use a variety of tasks, but struggle with the authenticity of some task types to real world settings. Therefore, the delivery of CDF would not necessarily be influenced by the context, but rather, the opportunities to provide it could be impacted as there are generally more variety in task types in classroom-based contexts. This variability arguably provides more diversity in the types of activities in which learners are engaged, and thus provide different opportunities for observing and generating information about learners’ strengths and areas for improvement.

The primary differences between the feedback providers in the different contexts is their knowledge and expertise. In workplace-based contexts, the agents are primarily content experts, while in classroom-based contexts, the agents are more likely to be task experts. Again, while both contexts engage learners in tasks which could be used to generate and deliver CDF, the differences in the agents might impact the content of the feedback and if there is emphasis or priority placed on some facets (based on the agents’ knowledge and expertise).

My transition to a new research context has provided rich opportunities for work, exploration, and investigation of educational issues, including cognitively diagnostic feedback, which extend across contexts. I greatly welcome the opportunity to connect with anyone interested in discussing these topics further. Please email me:


Alderson, J. C. (2005). Diagnosing foreign language proficiency: the interface between learning and assessment. London: Continuum.

Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice: Designing and developing useful language tests (Vol. 1). Oxford University Press.

Greenberg, I. (2012). ESL Needs Analysis and Assessment in the Workplace. In P. Davidson, B. O’Sullivan, C. Coombe, & S. Stoynoff (Eds.), The Cambridge guide to second language assessment (pp. 178-181). Cambridge University Press.

Hamdy, H. (2009). AMEE Guide Supplements: Workplace-based assessment as an educational tool. Guide supplement 31.1–Viewpoint. Medical teacher31(1), 59-60.

Hartz, S., & Roussos, L. (2008). The fusion model for skills building diagnosis: Blending theory with practicality (Report No. RR-08-71). Princeton, NJ: Educational Testing Service. Retrieved from

Huhta, A. (2010). Diagnostic and formative assessment. In B. Spolsky & F.M. Hult (Eds.), The handbook of educational linguistics (pp. 469-482). Oxford: Wiley-Blackwell

Jang, E. E. (2005). A validity narrative: the effects of cognitive reading skills diagnosis on ESL adult learners’ reading comprehension ability in the context of Next Generation TOEFL. Unpublished doctoral dissertation. University of Illinois at Urbana Champaign.

Jang, E. E., & Wagner, M. (2014). Diagnostic feedback in the classroom. In A.J. Kunnan (Ed.), Companion to Language Assessment, (pp. 693-711). Wiley-Blackwell.

Leighton, J. P., & Gierl, M. J. (Ed.). (2007). Cognitive diagnostic assessment for education: Theory and practices. Cambridge: Cambridge University Press.

Nichols, P. D., Chipman, S. F., & Brennan, R. L. (Ed.). (1995). Cognitively diagnostic assessment. NJ: Lawrence Erlbaum.

Nicol, D.J., & MacFarlane-Dick, D. (2006). Formative assessment and self-regulated learning: A model and seven principles of good feedback practice. Studies in Higher Education, 31(2), 199-218.

Norcini, J., & Burch, V. (2007). Workplace-based assessment as an educational tool: AMEE Guide No. 31. Medical teacher29(9-10), 855-871.

Sadler, D. R. (1998). Formative assessment: Revisiting the territory. Assessment in Education, 5(1), 77-84.

Shute, V. (2008). Focus on formative feedback. Review of Educational Research, 78(1), 153-

Wagner, M. (2015). The Centrality of cognitively diagnostic assessment for advancing secondary school ESL students’ writing: A mixed methods study (unpublished doctoral dissertation). Ontario Institute for Studies in Education/University of Toronto, Toronto, Ontario, Canada.

Wigglesworth, G. (2008). Task and performance based assessment. In Encyclopedia of language and education (pp. 2251-2262). Springer US.

Metaphors for Learning and Psychometrics

Laine Bradshaw, University of Georgia

What metaphor do you use to describe learning? What role does psychometrics play in building your narrative around this metaphor? In the United States, the dominant discourse on student learning relies on a pervasive metaphor. We use the metaphor of learning as travel in a physical space along a straight path, where moving “forward” and being “ahead” is good progress and staying “back” or being left “behind” is bad (Parks, 2010). The metaphor is so common, we can almost forget we are speaking metaphorically when we describe students using this language.

A number of years back, I listened as a researcher discussed her observations that classroom teachers nearly completely rely on the use of this metaphor to describe their students’ understandings, instead of using content- or concept-specific language to describe what they understand. My immediate observation was that the psychometric modeling framework ubiquitously used in large-scale testing in education is in sync with this narrative.

It made me wonder: Do unidimensional assessments that locate an overall student ability on a line reinforce this narrative? Or is that what is prompting this narrative? Or is it a little bit of both? I still don’t know. But I am working with a team of educators in local school districts to create a through-year, formative assessment system based on diagnostic classification models (DCMs), and I am eager to see how teachers’—and administrators’ and students’—conversations around students’ understandings and progress may change when the multidimensional diagnostic assessment results are readily available to them.

My hope is that the DCM-based design will encourage more of a “toolbox” (Parks, 2010) metaphor to represent student learning, where students develop differently from each other but those differences need not be ordered for comparison. Using this approach, I hope the discourse prompted by the assessments is centered on celebrating the acquisition of shiny new tools and setting clear goals to acquire specific tools they still need. I’m interested to hear from your experiences and about your ideas around the roles that assessments and psychometrics have played to shape discourse about student learning in the classrooms and schools where you have worked. Any other metaphors we should be thinking about? I discuss this topic, and provide an introduction to DCMs, in my chapter in the new Handbook on Cognition and Assessment. You can check it out here:

Bradshaw, L. (2016). Diagnostic classification models. In A. A. Rupp & J. P. Leighton (Eds.), The handbook of cognition and assessment: Frameworks, methodologies, and applications (pp. 297-327). Chichester, West Sussex: Wiley-Blackwell.

How to assess hard-to-measure constructs like creativity?

Valerie Shute

A few examples of hard‐to‐measure constructs that we’ve assessed in our own work lately include creativity (see Kim & Shute, 2015; Shute & Wang, 2016), problem solving (see Shute, Ventura, & Ke, 2015; Shute, Wang, Greiff, Zhao, & Moore, 2016), persistence (see Ventura & Shute, 2013), systems thinking (see Shute, Masduki, & Donmez, 2010), gaming‐the‐system (see Wang, Kim, & Shute, 2013), and design thinking (see Razzouk & Shute, 2012), among others. In this blog, I’d like to describe our approach used to measure creativity, specifically within the context of a digital game called Physics Playground. My premise is that good games, coupled with evidence‐based embedded assessment, show promise as a means to dynamically assess hard‐to‐measure constructs more accurately and much more engagingly than traditional approaches. This information, then, can be used to close-the-loop (i.e., use the estimates as the basis for providing targeted support).

Most of us would agree that creativity is a really valuable skill—in school, on the job, and throughout life. But it’s also particularly hard to measure for various reasons. For instance, there’s no clear and agreed‐upon definition, and it has psychological and statistical multidimensionality. The generality of the construct is also unclear (e.g., is there a single “creativity” variable, or is it solely dependent on the context). Finally, a common way to measure creativity is through self‐report measures, where data are easy to collect but unfortunately the measures are flawed. That is, self‐report measures are subject to “social desirability effects” that can lead to false reports the construct being assessed. In addition, people may interpret specific self‐report items differently leading to unreliability and lower validity.

To accomplish our goal of measuring creativity based on gameplay data, we followed the series of steps outlined and illustrated in Shute, Ke, and Wang (in press). These steps include: (1) Develop a competency model (CM) of targeted knowledge, skills, or other attributes based on full literature and expert reviews; (2) Determine which game (or learning environment) into which the stealth assessment will be embedded; (3) Compile a full list of relevant game-play actions/indicators that serve as evidence to inform the CM variables; (4) Create new tasks in the game, if necessary (Task model); (5) Create a Q-matrix to link actions/indicators to relevant facets of target competencies; (6)  Determine how to score indicators using classification into discrete categories which comprise the “scoring rules” part of the evidence model (EM); (7) Establish statistical relationships between each indicator and associated levels of competency variables (EM); (8) Pilot test Bayesian Networks (BNs) and modify parameters; (9) Validate the stealth assessment with external measures; then (10) Use the current estimates about a player’s competency states to provide adaptive learning support (e.g., targeted formative feedback, progressively harder levels relative to the player’s abilities, and so on).

In line with this process, the first thing we did once we decided to measure creativity was to conduct an extensive literature review. Based on the literature, we defined creativity as encompassing three main facets: fluency, flexibility, and originality. Fluency refers to the ability to produce a large number of ideas (also known as divergent thinking and brainstorming); flexibility is the ability to synthesize ideas from different domains or categories (i.e., the opposite of functional fixedness); and originality means that ideas are novel and relevant. There are other dispositional constructs that are an important part of creativity (i.e., openness to new experiences, willingness to take risks, and tolerance for ambiguity) but due to the nature of the game we were using as the vehicle for the assessment, we decided to focus on the cognitive skills of creativity. We shared the competency model with two well-known creativity experts, and revised the model accordingly.

Next, we brainstormed indicators (i.e., specific gameplay behaviors) that were associated with each of the main facets. For instance, flexibility is the opposite of functional fixedness and represents one’s ability to switch things up while solving a problem in the game. The game knows, per level, the simple machines (or “agents” as they’re called in the game) which are appropriate for a solution. So one flexibility indicator in the game would be the degree to which a player sticks with an inappropriate agent across solution attempts—which is reverse coded. We amassed all variables (creativity node and related facets) and associated indicators in an augmented Q-matrix, which additional contained all discrimination and difficulty parameters for each indicator in each level (see Almond, 2010; and Shute & Wang, 2016). In the basic format of the Q-matrix, the rows represent the indicators relevant for problem solving in each level and the columns represent the main facets of creativity. If an indicator is relevant to a skill, the value of the cell is “1” otherwise it is “0.” Translating the Q-matrix into Bayes nets (our statistical machinery to accumulate evidence across gameplay) involves using Almond’s CPTtools (

Based on the experiences to date in designing valid assessments of hard-to-measure constructs in game environments, I feel that it’s best to bring together educators, game designers, and assessment experts to work together from the onset. This type of diverse team is a critical part of creating an effective learning ecosystem. Having a shared understanding of educational and gaming goals is key to moving forward with the design of engaging, educational games. The next step in this process (which I’m trying, via multiple research proposals, to get funded) after establishing the validity of creativity and its particular facets—is how can we support its development?

Shute, V., & Wang, L. (2016). Assessing and supporting hard-to-measure constructs in video games. In A. A. Rupp and J. P. Leighton (Eds.), The Handbook on cognition and assessment: Frameworks, methodologies, & applications (pp. 535-563). West Sussex, UK: Wiley.


Principled Approaches to Assessment Design, Development, and Implementation: Illuminating Examinee Thinking

Steve Ferrara

As assessment theory and practice have evolved in the last 15 years, assessment designers and psychometricians have developed measurement models and practices that help us illuminate examinee processing of test items and other assessment activities. These developments include cognitive-diagnostic modeling (e.g., Rupp, Templin, & Henson, 2010), models of examinee thinking (e.g., Leighton & Gierl, 2007), and principled approaches to assessment design, development, and implementation (e.g., Ferrara, Lai, Reilly, & Nichols, 2016). The common goal of illuminating examinee thinking is enhance test score interpretation by uncovering why examinees responded successfully or unsuccessfully to items on an assessment in addition to whether they responded successfully. Test item development is as much art as it is science—maybe a “dark art,” as John Bormuth (1970) called it long ago. Principled approaches to design, development, and implementation shed light on examinee thinking and bring more science into the art of test item development. How so?

The most widely known principled approaches—Evidence-Centered Design, Cognitive Design Systems, Assessment Engineering, the BEAR Assessment System, and Principled Design for Efficacy—share common, foundational elements. These elements focus sharply on examinee thinking, with the goal of enhancing the validity of inferences we make from test scores about what examinees know and can do as well as their degree of achievement and competence. Table 1 from our chapter, Principled Approaches to Assessment Design, Development, and Implementation: Cognition in Score Interpretation and Use, makes the common elements plain.


Table 1

Foundation and Organizing Elements of Principled Approaches to Assessment Design, Development, and Implementation and their Relationship to the Assessment Triangle

Framework Elements Assessment Triangle Alignment
Organizing Element
Ongoing accumulation of evidence to support validity arguments Overall evidentiary reasoning goal
Foundational Elements
Clearly defined assessment targets Cognition
Statement of intended score interpretations and uses Cognition
Model of cognition, learning, or performance Cognition
Aligned measurement models and reporting scales Interpretation
Manipulation of assessment activities to align with assessment targets and intended score interpretations and uses Observation

From Ferrara et al. (2016). Used with permission.

The dual focus on cognition and gathering evidence to support intended score interpretations and uses enables assessment designers to illuminate examinee thinking. For example, specifying a model of cognition, learning, or performance helps guide design and development of assessment activities and ensures that assessment activities are aligned with the model of thinking and measurement models that undergird a test’s score reporting scale.

Test development organizations are implementing these principles into standard practice, probably incrementally rather than by tearing down current practices and retooling to implement new ones. Some tools of principled approaches—for example, task models and templates—may be in wide use as a way of providing more detailed item specifications than in the past. My co-authors and I have provided a framework for integrating principled practices into existing practices. We describe the framework, Principled Design for Efficacy (PDE), in our chapter. PDE is not intended as a competitor or replacement for more widely known models like ECD. Rather, we offer it as a framework to make incremental insertions into current practices, rather than tearing down and rebuilding. Figure 3 illustrates how assessment designers and program managers can build principled practices into existing ones.

FerrraraFigure 3. Conventional processes (white boxes) and processes based on principled approaches (foundational elements are numbered in the three grey boxes with grey background) for assessment design, development, and implementation showing overlap and differences. From Ferrara et al. (2016). Used with permission.

There is general agreement in the testing field that illuminating examinee cognition improves assessment design and development, the quality of assessments and the information about examinees that they provide, and enhances evidence for validity arguments to support intended score interpretations and uses. The Handbook on Cognition and Assessment shows us how to get there.



Bormuth, J. R. (1970). On the theory of achievement test items. Chicago: The University of Chicago Press.

Ferrara, S., Lai, E., Reilly, A., & Nichols. (2016). Principled approaches to assessment design, development, and implementation: Cognition in score interpretation and use. In A. A. Rupp and J. P. Leighton (Eds.), The Handbook on cognition and assessment: Frameworks, methodologies, & applications (pp. 41-74). Malden, MA: Wiley.

Leighton, J. P., & Gierl, M. J. (2007). Defining and evaluating models of cognition used in educational measurement to make inferences about examinees’ thinking processes. Educational Measurement: Issues and Practice, 26(2) 3-16.

Rupp, A. A., Templin, J., & Henson, R. A. (2010). Diagnostic measurement: theory, methods, and applications. New York: Guilford Press.



Deep Learning, Measurement, and Bayesian Networks


, , , , ,

José P. González-Brenes

If you have been reading the popular press that reports on technology, you probably have read about the hottest Artificial Intelligence trend—Deep Learning! It’s everywhere in smart computing—self-driving cars, computers that play strategy games, the list goes on, and on. But, as some critics have pointed out—Deep Learning can behave more like a black box model. This means that although Deep Learning can be extremely predictive, it provides little insight on the phenomenon that it is being modeled. In other words, the model learns something, but the humans that built the model cannot explain what was learned.

This is where the cool and sophisticated uncle of Deep Learning comes into play—Bayesian Networks. Bayesian Networks have been tested time and again on educational applications. You can design them to be shallow, and they are interpretable and rigorous enough that they can drive the statistics needed to design a high-stake application. This is an important difference between Bayesian Networks and Deep Learning.

Consider this—a loved one is taking an assessment that is going to determine if they will be admitted to college. It is unlikely that measurement scientists are going to rely on Deep Learning. High stake instruments should not be based on statistical models that nobody can explain. In other words, educational products need accountability. Bayesian Networks allow us to achieve as much interpretability as needed. However, you can also design deep Bayesian Networks, and then they start to resemble the Deep Learning methods that are so en vogue.

If you want to learn more about the role of Bayesian Networks in assessment, you may be interested in checking out our chapter in the Handbook of Cognition and Assessment! In the chapter we discuss the conceptual foundations of Bayesian networks, walk through their basic graphical and formula representations, and discuss their different generalizations as a measurement framework:

González-Brenes, J. P., Behrens, J. T., Mislevy, R. J., Levy, r., & DiCerbo, K. E. (2016). Bayesian networks. In A. A. Rupp & J. P. Leighton (Eds.), The handbook of cognition and assessment: Frameworks, methodologies, and applications (pp. 328-353). Chichester, West Sussex: Wiley-Blackwell.

Moving from a Craft to a Science in Assessment Design

Guest post by Paul Nichols, ACT

This is one of a series of blog posts from chapter authors from the new Handbook of Cognition and Assessment.  See Beginning of a Series: Cognition and Assessment Handbook for more details.

In our chapter in the Handbook, we present and illustrate criteria for evaluating the extent to which theories of learning and cognition, and the associated research, when used within a principled assessment design (PAD) approach, support explicitly connecting and coordinating the three elements comprising the assessment triangle: a theory or set of beliefs about how students think and develop competence in a domain (cognition), the content used to elicit evidence about those aspects of learning and cognition (observation), and the methods used to analyze and make inferences from the evidence (interpretation). Some writers have cautioned that the three elements comprising the assessment triangle must be explicitly connected and coordinated during assessment design and development or the validity of the inferences drawn from the assessment results will be compromised.

We claimed that a criterion for evaluating the fitness of theories of learning and cognition to inform assessment design and development was the extent to which theories facilitate the identification of content features that accurately and consistently elicit the targeted knowledge and skills at the targeted levels of complexity. PAD approaches attempt to engineer intended interpretations and uses of assessment results through the explicit manipulation of the features of content that tend to effectively elicit the targeted knowledge and skills at the targets complexity levels. From the perspective of PAD, theories of learning and cognition, along with the empirical research associated with the theories, should inform the identification of those important content features.

The claim from a PAD perspective is that training item writers to intentionally manipulate characteristic and variable content features enables item writers to systematically manipulate these features when creating items and tasks. Subsequently, items and tasks with these different features will elicit the kind of thinking and problem solving at the levels of complexity intended by the item assignment. But I have no scientific evidence supporting this claim. I have only rationale arguments, e.g., if item writer understand the critical content features then they will use them, and anecdotes, e.g., item writers told me they found the training helpful, to support this claim.

An approach that might help me and other researchers gather evidence with regard to such claims is called design science. Design science is fundamentally a problem solving paradigm. Design science is the scientific study and creation of artefacts as they are developed and used by people with the goal of solving problems and improving practices in peoples’ lives. In contrast to natural entities, artefacts are objects, such as tests, conceptual entities, such as growth models, scoring algorithms or PAD, or processes, such as standard setting methods, created by people to solve practical problems. The goal of design science is to generate and test hypotheses about generic artefacts as solutions to practical problems. Design science research deals with planning, demonstrating and empirically evaluating those generic artefacts. A main purpose of design science is to support designers and researchers in making knowledge of artefact creation explicit, thereby moving design from craft to science.

ACT is hosting a conference on educational and psychological assessment and design science this summer at their Iowa City, IA, headquarters. A small group of innovators in assessment are coming together to consider the potential of design science to aid assessment designers in designing and developing the next generation of assessments. Look for the findings from that conference at AERA or NCME in 2017.


Beginning of a Series: Cognition and Assessment Handbook

HandbookDear SIG colleagues,

We are incredibly excited to share with you information regarding a new book that we have co-edited with contributions from several member of our SIG as well as other colleagues in the field. It is called the Handbook of Cognition and Assessment: Frameworks, Methodologies, and Applications and will appear in Britain and the US in the fall. It is close to 600 pages long and includes a total of 22 chapters divided into 3 sections (Frameworks – 9 chapters, Methodologies – 6 chapters, and Applications– 7 chapters). In addition to the core invited chapters we wrote thoughtful introductory and synthesis chapters that situate the book within current developments in the field of cognitively-grounded assessment. We also created a glossary with key terms, including both those that are used repeatedly across chapters and those that are of particular importance to specific chapters, and have worked with all authors to find common definitions that represent appropriate consensus definitions. Throughout the book, we highlighted all of these terms through boldfacing upon first mention in each chapter.

We probably do not have to convince you that the publication of such a handbook is timely and we sincerely hope that it will be considered a useful reference for a diverse audience. To us, this audience includes professors, students, and postdocs, assessment specialists working at testing companies or for governmental agencies, and really anyone who wants to update their knowledge on recent thinking in this area. In order to help reach those audiences we worked closely with authors to keep the level of conceptual and technical complexity at a comparable level across chapters so that interested colleagues have
the ability to read individual chapters, entire sections, or even the full book. The list of contributors in this book is really quite amazing and, by itself, presents a great resource of expertise that you may want to consider the next time you are looking for advice on
a project or for someone on your advisory panel.

In order to get you all even more excited about the handbook we have invited our contributing authors to share their current thoughts via upcoming blog entries for the SIG so look out for those in the coming months! It also goes without saying that our book is only one of many artifacts that helps to synthesize ideas and build conceptual and
practical bridges across communities. Therefore, if you have additional suggestions for us regarding what kinds of efforts we could initiate around the handbook please let us know – we would love to hear from you! Finally, if you like the Handbook, we would
certainly appreciate it if you could share information about it on social media, perhaps for now by sharing the cover image below. Many thanks in advance and thank you very much for your interest!

André A. Rupp and Jacqueline P. Leighton

(Co-editors, Handbook of Cognition and Assessment)

Paper Rubric for 2017 AERA

As a measurement specialist, I’ve always found the AERA evaluation rubric to be a bit minimal. AERA provides the names of the scales, but little information about what goes into them. Some of that is a function of the fact that different divisions and SIGs have very different ideas of what constitutes research (qualitative, quantitative, methodological, literature synthesis). We as a SIG can do better. So please help me out with an experiment on this part.

AERA defines six scales for us (see below). The goal of this post is to provide a first draft of a rubric for those six areas. I’m roughly following a methodology from Mark Wilson’s BEAR system, particularly, the construct maps (Wilson, 2004). As I’ve been teaching the method, I take the scale and divide it up into High, Medium, and Low areas, than then think about what kind of evidence I might see that (in this case) a paper was at that level on the indicated criteria. I only define three anchor points, with the idea that a five point scale can interpolate between them.

In all such cases, it is usually easier to edit a draft than to create such a scale from scratch. So I’ve drafted rubrics for the six criteria that AERA gives us. These are very much drafts, and I hope to get lots of feedback in the comments about things that I left out, should not have included, or put in the wrong place. In many cases the AERA scale labels are deliberately vague so as to not exclude particular kinds of research. In these cases, I’ve often picked the label that would most often apply to Cognition and Assessment papers, with the idea that it would be interpreted liberally in cases where it didn’t quite fit.

Here they are:

1. Objectives and Purposes
High (Critically Significant) Objectives and purposes are clearly stated. Objectives and purposes involve both cognition and assessment. Objectives touch on issues central to the field of Cognition and Assessment.
Medium Objectives and purposes can be inferred from the paper. Objectives and purposes involve either cognition or assessment. Objectives touch on issues somewhat related to the field of Cognition and Assessment.
Low (Insignificant) Objectives and purposes are unclear. Objectives and purposes re only tagentially related to cognition or assessment. Objectives touch on issues unrelated to the field of Cognition and Assessment.
2. Perspective(s) or Theoretical Framework
High (Well Articulated) Both cognitive and measurement frameworks are clearly stated and free from major errors. The cognitive and measurement frameworks are complementary. Framework is supported by appropriate review of the literature.
Medium Only one of the cognitive and measurement perspectives are clearly stated, or
both are implicit. If errors are present, they are minor and easily corrected.
Fit between cognitive and measurement frameworks is not well justified. Framework is not well supported by the literature review.
Low (Not Articulated) Cognitive and measurement frameworks are unclear or have major substantive errors. There is a lack of fit between the cognitive and measurement models. Literature reviews clearly misses key references.

Note that implicit in this rubric is the idea that a Cognition and Assessment paper should both have a cognitive framework and a measurement framework.

3. Methods, Techniques or Modes of Inquiry
High (Now Well Executed) Techniques are clearly and correctly described. Techniques are appropriate for the framework. Appropriate evaluation of the methods is included in the paper.
Medium Techniques are described, but possibly only implicitly, or there are only easily corrected errors in the techniques. Techniques are mostly appropriate for the framework. Minimal evaluation of the methods is included in the paper.
Low (Well Executed) Techniques are not clearly described, implicitly or explicitly; or there are significant errors in the methods. Techniques are not appropriate for the framework. No evaluation of the methods is included in the paper.

I’ve sort of interchangeably used techniques and methods to stand for Methods, Techniques or Modes of Inquiry.

4. Data Sources, Evidence Objects, or Materials
High (Inappropriate) Data are clearly described. Data are appropriate for the framework and methods. Limitations of the data (e.g., small sample, convenience sample) are clearly acknowledged.
Medium Data source only partially described. Data not clearly aligned with the purposes/methods. Limitations of the data incompletely acknowledged.
Low (Appropriate) Data description is unclear. Data are clearly inappropriate given purpose, framework and/or methods. Data have obvious, unacknowledged limitations.

Here data (note that data is a plural noun) has to be interpreted liberally to incorporate traditional assessment data, simulation results, participant observations, literature reviews and other evidence to support the claims of the paper.

5. Results and/or substantiated conclusions or warrants for arguments/points of view
High (Well Grounded) Results are clearly presented, with appropriate standard errors or description of expected results is clear. Success criteria are clearly stated. It would be possible for the results to falsify the claims. If results are available, conclusions are appropriate from the results.
Medium Results are somewhat clearly presented or expected results are likely to be appropriate. Success criteria are implicit. There are many researcher degrees of freedom in the analysis, so the results presented are likely to be the ones that most strongly support the claims. If results are available, the conclusions are mostly appropriate.
Low (Ungrounded) Results are unclear, or the expected results are unclear. Success criteria are not stated and (are/could be) determined after the fact. The results are not capable of refuting the claim, or there are too many research degrees of freedom so that can always be made to support the claim. If results are present, the conclusion is inappropriate.

I’ve tried to carefully word this so that it is clear that both papers in which the results are present and in which the results are anticipated are appropriate. There are also two new issues which are not often explicitly stated, but should be. First, the standard of evidence should be fair in that it should be possible to either accept or reject the main claims of the paper on the basis of the evidence. Second, there are often many analytical decisions that an author can use to make the results look better, for example, choosing which covariates to adjust. Andrew Gelman refers to this as the Garden of Forking Paths. I’m trying to encourage both reviewers to look for this and authors to be honest about the data dependent analysis decisions they used, and the corresponding limitations of the results.

6. Scientific or Scholarly Significance of the study or work
High (Highly Original) Objectives and purposes offer either an expansion of the state of the art or important confirming or disconfirming evidence for commonly held beliefs. The proposed framework represents an extension of either the cognitive or measurement part or a novel combination of the two. Methods are either novel in themselves or offer novelty in their application. Results could have a significant impact in practice in the field.
Medium Objectives and purposes are similar to those of many other projects in the field. The proposed framework has been commonly used in other projects. Paper presents a routine application of the method. Results would mostly provide continued support for existing practices.
Low (Routine) Objectives and purposes offer only a routine application of well understood procedures to a problem without much novelty. The framework offers no novelty, or the novelty is a function of inappropriate use. If novelty in methodology is present, it is because the application is inappropriate. Results would mostly be ignored because of flaws in the study or lack or relevance to practice.

When I’m asked to review papers without a specific set of criteria, I always look for the following four elements:

  1. Novelty
  2. Technical Soundness
  3. Appropriateness for the venue
  4. Readability

These don’t map neatly onto the six criteria that AERA uses. I tried to build appropriateness into AERA’s criteria about Objectives and purposes, and to build novelty into AERA’s criteria about Significance. Almost all of AERA’s criteria deal with some aspect of technical soundness.

Readability somehow seems left out. Maybe I need another scale for this one. On the other hand, it has an inhibitor relationship with the other scales. If the paper is not sufficiently readable, then it fails to make its case for the other criteria.

It is also hard to figure out how to weigh the six subscales onto the overall accept/reject decision axis. This is the old problem of pushing multiple scales onto a single scale. It is a bit harder because this is an interesting relationship, being part conjunctive and part disjunctive.

The conjunctive part comes with the relationship between the Low and Medium levels. Unless all of the criteria are at least at moderate levels, or the flaws causing the paper to get a low rating on that criteria are easy to fix, there is a fairly strong argument for rejecting the paper and not representing
sufficiently high quality work.

However, to go from the minimally acceptable to high priority for acceptance, the relationship is disjunctive: any one of the criteria being high (especially very high) would move it up.

A bigger problem is what to do with a paper that is mixed: possibly high in some areas, and low in others. Here I think I need to rely on the judgement of the referees. With that said, here is my proposed rubric for overall evaluation.

Overall Evaluation
Clear Accept All criteria are at the Medium level, with at least some criteria at the High
Research has at least one interesting element in the choice of objectives, framework, methods, data or results that would make people want to hear the paper/see the poster.
Borderline Accept Most criteria are at the Medium level, with any flaws possible to correct by the authors before presentation. Research may be of interest to at least some members of the SIG.
Clear Reject One or more criteria at the Low level and flaws difficult to fix without fundamentally changing the research. Research will be of little interest to members of the SIG.

The last problem is that we are using short abstracts rather than full papers. In many cases, there may not be enough material in the paper to judge? What are your feelings about that. The SIG management team has generally like the abstract review format as that makes both the reviewing faster, and it easier to submit in progress work. Should we continue with this format? (Too late to change for 2017, but 2018 is still open.)

I’m sure that these rubrics have many more issues than the ones I’ve noticed here. I would encourage you to find all the holes in my work and point them out in the comments. Maybe we can get them fixed before we use this flawed rubric to evaluation your paper.

Edit:  I’ve added the official AERA labels for the scales in parenthesis, as AERA has FINALLY let me in to see them.

2017 AERA Call for Proposals


Cognition and Assessment SIG 167

You are invited to submit a proposal to the 2017 annual meeting of AERA Cognition and Assessment SIG. The Cognition and Assessment SIG presents researchers and practitioners with an opportunity for cross-disciplinary research within education. We are a group of learning scientists and researchers who are interested in better assessing cognition and are interested in leveraging cognitive theory and methods in the design and interpretation of assessment tools including tests. Our research features many different methods, including psychometric simulations, empirical applications, cognitive model development, theoretical rationales, and combinations of all of the above.

The Cognition and Assessment SIG welcomes research proposals covering an array of topics that meet the broad needs and research interests of the SIG. We encourage you to submit 500 Word Proposals for: Symposium, Paper, Poster or other Innovative sessions. If you would like to use the full AERA limit on words, however, we will accept that.

Please submit proposals to the Cognition and Assessment SIG through the AERA online program portal by July 22, 2016. Please find the specific proposal guidelines for each of the session types here:

If you have any question about submitting a proposal to the Cognition and Assessment SIG, please contact the SIG Program Chair, Dr. Russell Almond < > or SIG Chair Dr. Laine Bradshaw < >.  

Note:  Discussing and clarifying the rubric used to evaluate the proposals is an important issue related to our SIG.  Look for an upcoming blog post on this topic.