Objectives and Evidence

Its time to mark my beliefs to market on my earlier statements about Behaviourist and Cognitive perspectives on assessment.

I’m now involved in a small consulting project where Betsy Becker, myself and some of our students are helping an agency review their licensure exam.  Right now we have our students working on extracting objectives from the documents of the requirements.  We are trying this out with the cognitive approach rather than the behaviourist one, so our students are asking questions about how to write the objectives.  I thought I would share our advise, and possibly get some feedback from the community at large.

I don’t have permission from our client to discuss their project, so let me use something from my introductory statistics class.  Reading the chapter on correlation, I find the following objective:

  • Understand the effect of leverage points and outliers on the correlation coefficient.

The behaviourists wouldn’t like that one, because it uses an unmeasurable verb, understands.  They would prefer met to substitute an observable verb in the statement, something like:

  • Can estimate the effect of a leverage point or outlier on the correlation coefficient.

This is measurable, but it only captures part of what I had in mind in the original objective.  The solution is to return to the original objective, but add some evidence:

  • Understand the effect of leverage points and outliers on the correlation coefficient.
    1. Given a scatterplot, can identify outliers and leverage points.
    2. Given a scatterplot and with a potential leverage point identified, can estimate the effect of removing the leverage point on the correlation coefficient.
    3. Can describe the sensitivity of the correlation coefficient to leverage points in words.
    4. Given a small data set with a leverage point, can estimate the effect of the leverage point on the correlation.
    5. Can add a high leverage point to a data set to change the correlation in a predefined direction.

The list is not exhaustive.  In fact, a strong advantage of the cognitive approach is that it suggest novel ways of thinking about measuring and perhaps then teaching the objective.

Listing at least a few examples of evidence when defining the objective helps make the objective more concrete.  It also ties it to relative difficulty, as we can move these sources of evidence up and down Bloom’s taxonomy to make the assessment harder or easier.  For example:

  1. [Easier] Define leverage point.
  2. [Harder] Generate an example with a high leverage point.

We are instructing our team to write objectives in this way.  Hopefully, I’ll get a chance later to tell you about how well it worked.



Cog & Assessment Poster Session (April 11, 10:00)


This is an open thread for the SIG poster session. Poster presenters, feel free to add more detail (and handouts) about your work.  Others, feel free to comment on and give feedback on the posters.

Cognition & Assessment SIG Poster Session
Mon, April 11, 10:00 to 11:30am,
Convention Center, Level Two, Exhibit Hall D


  • Allocation of Visual Attention When Processing Multiple‐Choice Items With Different Types of Multiple  Representations ‐ Steffani Sass, IPN ‐ Leibniz Institute for Science and Mathematics Education; Kerstin Schütte, IPN;  Marlit Annalena Lindner, IPN ‐ Leibniz Institute for Science and Mathematics Education
  • Definition and Development of a Cognitively Diagnostic Assessment of the Early Concept of Angle Using the Q‐Matrix Theory and the Rule‐Space Model ‐ Elvira Khasanova, University at Buffalo ‐ SUNY
  • In‐Task Assessment Framework: A Framework for Assessing Individual Collaborative Problem‐Solving Skills in an Online Environment ‐ Jessica J Andrews, Educational Testing Service; Deirdre Song Kerr, Educational Testing Service; Paul Horwitz, The Concord Consortium; John Chamberlain, Center for Occupational Research and Development; Al Koon; Cynthia McIntyre, The Concord Consortium; Alina A. Von Davier, ETS
  • Measuring Reading Comprehension: Construction and Validation of a Cognitive Diagnostic Assessment ‐ Junli Wei, University of Illinois at Urbana‐Champaign
  • Assessing Metacognition in the Learning Process (MILP): Development of the MILP Inventory ‐ Inka Hähnlein, University of Passau; Pablo Nicolai Pirnay‐Dummer, University of Halle, Germany
  • Developing Bayes Nets for Modeling Student Cognition in Digital and Nondigital Assessment Environments ‐ Yuning Xu, Arizona State University; Roy Levy, Arizona State University; Kristen E. Dicerbo, Pearson; Emily R. Lai, Pearson; Laura Holland, Pearson Education, Inc.
  • Implementing Diagnostic Classification Modeling in Language Assessment: A Cognitive Model of Second Language  Reading Comprehension ‐ Tugba Elif Toprak, Gazi University
  • Introduction to Truncated Logistic Item Response Theory Model ‐ Jaehwa Choi, The George Washington University
  • Misconceptions in Middle‐Grades Statistics: Preliminary Findings From a Diagnostic Assessment ‐ Jessica Masters, Research Matters, LLC; Lisa Famularo, Measured Progress
  • Using Verbal Protocol Analysis to Detect Test Bias ‐ Megan E. Welsh, University of California ‐ Davis; Sandra M. Chafouleas, University of Connecticut; Gregory Fabiano; T. Chris Riley‐Tillman, University of Missouri ‐ Columbia

Using Multiple Learning Progressions to Support Assessment (April 10, 2:45)


This is an open thread for comments on the upcoming AERA symposium.  Feel free to share thoughts and discuss the papers below.

Sun, April 10, 2:45 to 4:15pm,
Marriott Marquis, Level Four, Independence Salon G
Title: Using Multiple Learning Progressions to Support Assessment

Abstract: To use learning progressions to support students’ conceptual and linguistic development, teachers must elicit samples of student reasoning and discourse. This requires a change in pedagogical approaches, from the traditional teacher‐fronted direct instruction to the facilitation of meaning‐making discourse and student reasoning. Both
educators and students need resources  to enact these changes, and it is critical that these resources support the full inclusion of English learners. This paper discusses an NSF‐funded project to develop and pilot such resources. Teachers implementing this discourse‐focused instruction reported that it provided them frequent opportunities to track students’ conceptual understanding and to assist them, in the moment, to probe more deeply and to model the academic language that students needed to express themselves effectively.


  • Using a Proportional Reasoning Learning Progression to Develop, Score, and Interpret Student explanations –E. Caroline Wylie, ETS; Malcolm Bauer, ETS
  • Tandem Learning Progressions Provide a Salient Intersection of Student Mathematics and Language Abilities ‐ Alison L. Bailey, University of California ‐ Los Angeles; Margaret University of Minnesota
  • Simultaneous Assessment of Two Learning Progressions for Mathematical Practices ‐ Gabrielle Cayton; Leslie Nabors Olah, Educational Testing Service; Sarah Ohls, ETS; Allyson J. Kiss, University of Minnesota
  • Professional Development to Support Formative Assessment of Mathematics Constructs and Language in Mathematics Classrooms ‐ Christy Reveles, WIDA at Wisconsin Center for Educational Research; Rita MacDonald, University of Wisconsin ‐ Madison

Chair: E. Caroline Wylie, ETS
Discussant : Guadalupe Carmona, The University of Texas at San Antonio


Cognition & Assessment SIG Panel (April 8, 2:15)


This is an open comment thread for the Panel session.  Feel free to continue the panel discussion off-line.

Sat, April 9, 2:15 to 3:45pm,
Marriott Marquis, Level Four, Independence Salon F
Title: Principled Assessment Design in Action: Best Professional Practices for Digitally
Delivered Learning and Assessment Systems

In this moderated panel session, interdisciplinary experts with both technical and practical expertise discuss best practices for putting a principled design approach for digitally‐delivered learning and assessment systems into practice. They will debate critical issues around successful professional practices within their teams and institutions in
order to develop and nurture coherent ways of acting, reflecting, and planning. That is, they will discuss their “lessons learned”, both successful and not‐so‐successful. Panelists will also provide take‐home handouts with key principles for further dissemination.

Panelists and topics:

  • Process and Product Data Capture ‐ Tiago Calico, University of Maryland ‐ College Park
  • Automated Writing Diagnostics and Scoring ‐ Peter W. Foltz, Pearson
  • Diagnostics for Digital Learning Environments ‐ Janice D. Gobert, Worcester Polytechnic Services
  • Assessment of Professional Skills ‐ Vandhana Mehta, Cisco Systems Inc
  • Research Capacities for Technology‐Rich Assessment ‐ Andreas H. Oranje, Educational Testing Services
  • Stealth Assessment ‐ Valerie J. Shute, Florida State University
  • Computational Psychometrics ‐ Alina A. Von Davier, ETS

Chair: Andre A. Rupp, Educational Testing Service (ETS)


Cognitive Models for Assessment (April 9, 10:35)


This is an open comments section for the upcoming session at AERA.  Feel free to comment and discuss what was learned at the session.

Sat, April 9, 10:35am to 12:05pm,
Marriott Marquis, Level Four, Independence Salon F
Title: Cognitive Models for Assessment

Cognitive modeling has been actively used to understand human cognition in a wide range of educational research.  However, application of cognitive modeling to educational assessment does not have an extensive history. With a growing emphasis on problem solving skills and inquiry practices, interactive games and simulations are becoming a
common tool for educational assessments. Compared with traditional models, cognitive modeling offers enhanced capabilities to understand complex process data from game/simulation‐based assessments at lower levels of grain size. This symposium will present a few examples of how cognitive modeling is being used to understand and assess cognitive processes in various game/simulation‐based assessments. The benefits and limitations of cognitive modeling approach and the implications to future assessment research will be also discussed.


  • Modeling Science Inquiry in an Interactive Simulation Task ‐ Jung Aa Moon, Educational Testing Service
  • Extending the Aditive Factors Model to Assess Student Learning Rates ‐ Ran Liu, The University of Pennsylvania; Kenneth R. Koedinger, Carnegie Mellon University
  • Evaluating the Efficacy of Real‐Time Scaffolding for Data Interpretation Skills ‐ Raha Moussavi, Worcester Polytechnic Institute; Michael A. Sao Pedro, Worcester Polytechnic Institute & Apprendis LLC; Janice D. Gobert, Worcester Polytechnic Institute
  • What Are Mental Models of Electronic Circuits? Basing an Assessment on Computational Simulations of Experts ‐ Kurt VanLehn, Arizona State University ‐ Tempe
  • From Artificial Intelligence to Intelligent Assessment ‐ Michelle LaMar, Educational Testing Service

Chairs: Jung Aa Moon, Educational Testing Service; Michelle LaMar, Educational Testing Service
Discussant : Irvin R. Katz, Educational Testing Service

Crowdsourcing an ECD database

In the response to the recent AERA call for key findings and results, I wrote a short peice on evidence-centered assessment design (ECD) as a key technology.  I thought you might be interested, so here is a copy:


You may also want to look at the bibliography separately.  Here it is in bibtex format:


As usual, I was doing this at the last minute.  So desipite some feeback from Bob Mislevy and Val Shute, I’m sure I missed somebody important or some key reference.  Please use the comment section to tell me who I and what I missed.

On a related note, I’m trying to get the page of ECD projects at


up to date.  Email me if you want the editing password.

Cognitive and Behaviorist Approaches to Assessment

Recently, I have been updating the official syllabi for some of my courses, and I have been experiencing some cognitive dissonance in that process. At my university, we are supposed to write out objectives using “action verbs” and we are specifically prohibited
from using the the words “know” and “understand.” The goal is to make objectives that are measurable. But that is where my dissonance is coming from: I think this approach is conflating the goals and the measurement.

For example, for the intro stat class I would write an objective as

  • understands the mean, median and mode and the strengths and weakness of each measure.

Naturally, as the key verb is understands, it cannot be directly measured. That’s okay, though, I can just apply evidence-centered design and ask the question about what kind of observations could provide evidence for this objective. Some examples
might be

  • Can calculate the mean, median and mode of a small data set using only a calculator.
  • Can predict the effects of outliers on each of the three measures.
  • Can recognize situations in which the median is a better measure of center than the mean.

Naturally, this list of not exhaustive.

One part of my cognitive dissonance arises because the Curriculum Committee which set up the rules for the syllabus objectives are coming at assessment from a different perspective than I am. In referring to a paper
of Bob Mislevy’s where he lays out four psychological perspectives for educational arguments: a trait perspective, a behaviorist perspective, a information-processing perspective, and a sociocultural perspective. The Curriculum committee seems to favor a behaviourist perspective where the focus is on specific behaviors we can observe
the students do. I, on the other hand, am coming at the problem of course design from a information-processing perspective. My concern is whether or not the students acquire basic statistical tools they can use to process information in the course of the class.

Even though I can write out a list of behaviours which correspond to the list of knowledge objectives I want to measure, the list of knowledge objectives is more compact. Can I really properly write out everything I want the students to be able to do with the concept of mean? This results in a list of course objectives that is several pages long. Furthermore, I like the idea of being to make the list of behaviours open ended, I’m not sure I could list out every way I could eventually think about that a student in an introductory stat class could use a mean. On the other hand, the students probably strongly prefer a finite list of objectives, so that they can be sure that they have studied every type of problem that is likely to appear on the test.

Curiously, if I came at this from a sociocultural perspective, I would not have the same difficulty with the action words. One of the objects for my intro stat class is for the student

  • to be able to working from computer (SPSS) output describe the results of the analysis in the style of the results section of a research paper.

The active verb here is “describe”, which is just fine with the Curriculum Committee. However, the phrase “style of the results section of a research paper” hides a multitude of details. In fact, I don’t think I could write out all of the rules for good style
(although I have made some attempts), all I can really say is that I know it when I see it. I have a remarkable good agreement rate with other faculty members on this topic, but even senior graduate students helping me provide feedback to students in the stat class miss many of the stylistic details I think are important.

To be fair, I think the Curriculum Committee’s rule is at least moving in the right direction. It forces the course designer to think about how the objectives will be measured, which is a good thing. I’m just not sure that we have a consensus on what the best way to write measurable objectives is.

The Purpose of Cognitive Diagnostic Assessment

The Purpose of Cognitive Diagnostic Assessment

“A wise fish never goes anywhere without a porpoise.”
Mock Turtle to Alice, Alice’s Adventures in Wonderland ,
Lewis Carroll.

I use this quote to start off my class on test construction because
the purpose of an assessment is the very first and very last thing you
should think about when building a test. It is important to think
about it in the beginning, because unless the design team is all on
the same page about the purpose of the assessment, they will produce
pieces that do not fit together well. The last part of the test
construction process should be a validity study which ensures that the
newly created assessment is actually suitable for the purpose to which
it is to be put. Failing to nail down the purpose early in the design
process invites the
nefarious scope
to be a part of your design team.

The problem for most assessment design projects is that when the team
gets partway through the design process, somebody will have a
brilliant idea about a second purpose the assessment can be used for.
After all, as long as you are going to all the effort and expense of
building a new assessment, why not …? Stop! This is
potential trouble. At the very least the new purpose will require an
extra validity study as now both purposes have to be validated. At
worst, it can seriously dilute the collection of tasks on the test. A
biblical quote comes to mind:

No one can serve two masters. Either
you will hate the one and love the other, or you will be devoted to
the one and despise the other. (Mathew 6:24, NIV).

A similar principle holds for assessments with multiple purposes, one
of the purposes will benefit at the expense of all the others.

The problem for cognitively diagnostic assessment is that it is often
the second purpose, grafted on to an assessment whose primary purpose
is a high-stakes selection or placement decision. It is natural that
people who do poorly in such an exam would want additional diagnostic
information about where they fell short. So it seems like
retrofitting a diagnostic report onto the high-stakes assessment would
be a natural benefit to examinees.

Here is where the two purposes come into play. If the primary purpose
is a high-stakes selection or placement decision, then the
overwhelming need of the test is high reliability. Usually this is
accomplished by doing some kind of pretest and then looking at the
biserial correlation between the item score and the total test score.
Items with a low correlation are deemed to have low reliability and do
not make it to the final test form.

But, these low reliability items might be exactly the items that are
good at providing good differential diagnosis. In particular, if the
purpose of diagnosis is to determine whether an examinee is lacking
in Skill 1 or Skill 2, and the presence
of Skill 1 and the presence of Skill 2 are
moderately to strongly correlated in the population, items that
provide good differential diagnosis between Skill 1
and Skill 1 are likely to have lower bisearial
correlations with the total score. Therefore, the best items for
diagnostic assessment get purged from the assessment by the test
construction procedure.

High-stakes assessment also puts demands on test security which is not
such a stringent requirement. In particular, for a high-stakes
testing program, the items need to be periodically replaced and new
forms created to keep examinees from studying the specific test items
instead of the general construct measured by the assessment. This
brings about the need for equating the forms. In this case adding
diagnostic feedback to each of the items is very expensive, because the
feedback needs to be reauthored for each new form of the assessment.

An alternative would be a two-stage procedure. The first stage is
the original high-stakes exam. Examinees who are not happy with their
scores at the first stage can then take the second stage diagnostic
exam. As the diagnostic exam is low-stakes (reported only back to the
examinees and possibly the examinees’ instructors), there is no need
to change the items (unless the test specifications change). Also,
this can be done online without proctoring, making the test long
enough to get enough information about each aspect of proficiency that
is important. Also, if the high-stakes stage can be linked to the
diagnostic stage, then the scores from the first stage can be used as
a starting point for the diagnostic analysis (this is straightforward
to do with Bayesian scoring algorithms).

Cognitive Science And Assessment Blog: Welcome

Welcome to the Cognitive Science and Assessment Blog

We are the Cognitive Science and Assessment Special
Interest Group (SIG)
of the American
Educational Research Association (AERA)
. The SIG is made up of
people who are interested in the intersection of cognitive science
and assessment especially as applied to education. If you are
interested in any of those topics, then you are among our target

The bloggers for this site are mostly professionals in the field along
with some graduate students. We expect that many of the readers will
be graduate students, so we will try to keep discussions at that
level. However, that still means that there will still likely be a
lot of jargon, especially about measurement (i.e., assessment or
testing) that will be difficult to understand for a lay audience. If
you are interested in the topics in this blog, but are having trouble
with the jargon, we would like to recommend the collection of
educational resources about assessment available at
the National Council on Measurement in
Education (NCME)

Why a blog?

The idea came from a conversation I had with SIG President Andre Rupp,
as well as a number of others, about how to better foster a sense of
community among SIG members, particularly between the annual meetings
of the AERA. My idea was that a technical blog, something at about
the level of Andrew Gelman’s
Statistical Modeling Blog
, something that would encourage
discussion about topics that were technical but not so technical that
only a few people could understand them, would encourage people to
spend time discussing with each other. Furthermore, by encouraging a
highly literate and technical audience, we would get technically
interesting and useful discussions.

A second reason for starting a blog is that there are a high number of
semi-technical topics that we need to work through in our field:
things which are too well understood to be the subject of papers, but
not well enough worked out to be in the standard textbooks. Take the
issue of how long an assessment should be, especially a
cognitively diagnostic assessment which measures multiple aspects of
proficiency. The [APA/AERA/NCME Joint] Standards state that the
assessment needs to be of a length that is suitable to its purpose.
Anybody who has done assessment design knows that there are a lot of
complex considerations that the designers need to balance when
deciding on the length of the assessment. My goal is not to work this
issue now (good topic for a future post), but rather to point out that
a lively discussion from multiple experts in the field would benefit
both the community of practitioners and students trying to learn both
the art and science of assessment.

But aren’t blogs dead?

I’ve seen on the internet lately (ironically from blogs I follow)
several posts indicating that blogging is a dead art. I think that
the truth is we are just now discovering what it is that blogs are
best at. Mailing lists work well for calls for papers and post-doc
positions, but they don’t really encourage discussions. After all,
there are many times in my life when I want less rather than more
email. Facebook and LinkedIn work well for keeping friends and
colleagues updated about your life, but again are not the right forum
for technical discussions. Twitter is very good at capturing
reactions to other content, but not on originated the content itself
(especially not anything that would take more than 140 characters to

So what is the niche for a blog? I think the answer is just what we
are proposing to do: short technical articles followed by technical
discussion. Unlike media companies, we don’t need an ever wider
audience to sell to advertisers to pay our salaries. Instead we need
the right audience (if you have read this far I hope that includes
you) to keep the discussion lively and interesting.

Who are the participants?

My goal initially is to have enough entries in the queue to be able to
post a new entry every week. At least initially, my goal is to get
enough co-bloggers who commit to one article per month (or even less)
that we have a good queue of material to post without anybody going
nuts (in my case, that may be too late). Andre and I have already
reached out to a number of you (although I have forgotten some of the
people I have talked to about this). If you are willing to
contribute, contact Russell Almond or Matthew Madison.

Of course, some of the most important positions in a blog are the
commenters. If you want to join the conversation you are welcome to
contribute below. As long as you have an interest in cognition,
assessment or both and are willing to maintain an appropriate level of
professional courtesy we welcome you. This is not an official
publication of the AERA or the Cognition and Assessment SIG, and
all opinions are those of the bloggers and commenters and not
representative of any particular institution, particularly the
places where they are employed.
However, we are unofficial
representatives of those institutions and we expect that all discussion
will follow the standards of professional behaviour we have come to

What do you think about this idea?

Your comments about the new blog, what its scope and rules should be,
what topics we should cover, &c are welcome below.