Crowdsourcing an ECD database

In the response to the recent AERA call for key findings and results, I wrote a short peice on evidence-centered assessment design (ECD) as a key technology.  I thought you might be interested, so here is a copy:

http://ecd.ralmond.net/ECDFindings3.rtf

You may also want to look at the bibliography separately.  Here it is in bibtex format:

http://ecd.ralmond.net/ECDBib.bib

As usual, I was doing this at the last minute.  So desipite some feeback from Bob Mislevy and Val Shute, I’m sure I missed somebody important or some key reference.  Please use the comment section to tell me who I and what I missed.

On a related note, I’m trying to get the page of ECD projects at

http://ecd.ralmond.net/ecdwiki/ECD/Projects

up to date.  Email me if you want the editing password.

Cognitive and Behaviorist Approaches to Assessment

Recently, I have been updating the official syllabi for some of my courses, and I have been experiencing some cognitive dissonance in that process. At my university, we are supposed to write out objectives using “action verbs” and we are specifically prohibited
from using the the words “know” and “understand.” The goal is to make objectives that are measurable. But that is where my dissonance is coming from: I think this approach is conflating the goals and the measurement.

For example, for the intro stat class I would write an objective as

  • understands the mean, median and mode and the strengths and weakness of each measure.

Naturally, as the key verb is understands, it cannot be directly measured. That’s okay, though, I can just apply evidence-centered design and ask the question about what kind of observations could provide evidence for this objective. Some examples
might be

  • Can calculate the mean, median and mode of a small data set using only a calculator.
  • Can predict the effects of outliers on each of the three measures.
  • Can recognize situations in which the median is a better measure of center than the mean.

Naturally, this list of not exhaustive.

One part of my cognitive dissonance arises because the Curriculum Committee which set up the rules for the syllabus objectives are coming at assessment from a different perspective than I am. In referring to a paper
of Bob Mislevy’s where he lays out four psychological perspectives for educational arguments: a trait perspective, a behaviorist perspective, a information-processing perspective, and a sociocultural perspective. The Curriculum committee seems to favor a behaviourist perspective where the focus is on specific behaviors we can observe
the students do. I, on the other hand, am coming at the problem of course design from a information-processing perspective. My concern is whether or not the students acquire basic statistical tools they can use to process information in the course of the class.

Even though I can write out a list of behaviours which correspond to the list of knowledge objectives I want to measure, the list of knowledge objectives is more compact. Can I really properly write out everything I want the students to be able to do with the concept of mean? This results in a list of course objectives that is several pages long. Furthermore, I like the idea of being to make the list of behaviours open ended, I’m not sure I could list out every way I could eventually think about that a student in an introductory stat class could use a mean. On the other hand, the students probably strongly prefer a finite list of objectives, so that they can be sure that they have studied every type of problem that is likely to appear on the test.

Curiously, if I came at this from a sociocultural perspective, I would not have the same difficulty with the action words. One of the objects for my intro stat class is for the student

  • to be able to working from computer (SPSS) output describe the results of the analysis in the style of the results section of a research paper.

The active verb here is “describe”, which is just fine with the Curriculum Committee. However, the phrase “style of the results section of a research paper” hides a multitude of details. In fact, I don’t think I could write out all of the rules for good style
(although I have made some attempts), all I can really say is that I know it when I see it. I have a remarkable good agreement rate with other faculty members on this topic, but even senior graduate students helping me provide feedback to students in the stat class miss many of the stylistic details I think are important.

To be fair, I think the Curriculum Committee’s rule is at least moving in the right direction. It forces the course designer to think about how the objectives will be measured, which is a good thing. I’m just not sure that we have a consensus on what the best way to write measurable objectives is.

The Purpose of Cognitive Diagnostic Assessment

The Purpose of Cognitive Diagnostic Assessment

“A wise fish never goes anywhere without a porpoise.”
Mock Turtle to Alice, Alice’s Adventures in Wonderland ,
Lewis Carroll.

I use this quote to start off my class on test construction because
the purpose of an assessment is the very first and very last thing you
should think about when building a test. It is important to think
about it in the beginning, because unless the design team is all on
the same page about the purpose of the assessment, they will produce
pieces that do not fit together well. The last part of the test
construction process should be a validity study which ensures that the
newly created assessment is actually suitable for the purpose to which
it is to be put. Failing to nail down the purpose early in the design
process invites the
nefarious scope
creep
to be a part of your design team.

The problem for most assessment design projects is that when the team
gets partway through the design process, somebody will have a
brilliant idea about a second purpose the assessment can be used for.
After all, as long as you are going to all the effort and expense of
building a new assessment, why not …? Stop! This is
potential trouble. At the very least the new purpose will require an
extra validity study as now both purposes have to be validated. At
worst, it can seriously dilute the collection of tasks on the test. A
biblical quote comes to mind:

No one can serve two masters. Either
you will hate the one and love the other, or you will be devoted to
the one and despise the other. (Mathew 6:24, NIV).

A similar principle holds for assessments with multiple purposes, one
of the purposes will benefit at the expense of all the others.

The problem for cognitively diagnostic assessment is that it is often
the second purpose, grafted on to an assessment whose primary purpose
is a high-stakes selection or placement decision. It is natural that
people who do poorly in such an exam would want additional diagnostic
information about where they fell short. So it seems like
retrofitting a diagnostic report onto the high-stakes assessment would
be a natural benefit to examinees.

Here is where the two purposes come into play. If the primary purpose
is a high-stakes selection or placement decision, then the
overwhelming need of the test is high reliability. Usually this is
accomplished by doing some kind of pretest and then looking at the
biserial correlation between the item score and the total test score.
Items with a low correlation are deemed to have low reliability and do
not make it to the final test form.

But, these low reliability items might be exactly the items that are
good at providing good differential diagnosis. In particular, if the
purpose of diagnosis is to determine whether an examinee is lacking
in Skill 1 or Skill 2, and the presence
of Skill 1 and the presence of Skill 2 are
moderately to strongly correlated in the population, items that
provide good differential diagnosis between Skill 1
and Skill 1 are likely to have lower bisearial
correlations with the total score. Therefore, the best items for
diagnostic assessment get purged from the assessment by the test
construction procedure.

High-stakes assessment also puts demands on test security which is not
such a stringent requirement. In particular, for a high-stakes
testing program, the items need to be periodically replaced and new
forms created to keep examinees from studying the specific test items
instead of the general construct measured by the assessment. This
brings about the need for equating the forms. In this case adding
diagnostic feedback to each of the items is very expensive, because the
feedback needs to be reauthored for each new form of the assessment.

An alternative would be a two-stage procedure. The first stage is
the original high-stakes exam. Examinees who are not happy with their
scores at the first stage can then take the second stage diagnostic
exam. As the diagnostic exam is low-stakes (reported only back to the
examinees and possibly the examinees’ instructors), there is no need
to change the items (unless the test specifications change). Also,
this can be done online without proctoring, making the test long
enough to get enough information about each aspect of proficiency that
is important. Also, if the high-stakes stage can be linked to the
diagnostic stage, then the scores from the first stage can be used as
a starting point for the diagnostic analysis (this is straightforward
to do with Bayesian scoring algorithms).

Cognitive Science And Assessment Blog: Welcome

Welcome to the Cognitive Science and Assessment Blog

We are the Cognitive Science and Assessment Special
Interest Group (SIG)
of the American
Educational Research Association (AERA)
. The SIG is made up of
people who are interested in the intersection of cognitive science
and assessment especially as applied to education. If you are
interested in any of those topics, then you are among our target
readers.

The bloggers for this site are mostly professionals in the field along
with some graduate students. We expect that many of the readers will
be graduate students, so we will try to keep discussions at that
level. However, that still means that there will still likely be a
lot of jargon, especially about measurement (i.e., assessment or
testing) that will be difficult to understand for a lay audience. If
you are interested in the topics in this blog, but are having trouble
with the jargon, we would like to recommend the collection of
educational resources about assessment available at
the National Council on Measurement in
Education (NCME)
.

Why a blog?

The idea came from a conversation I had with SIG President Andre Rupp,
as well as a number of others, about how to better foster a sense of
community among SIG members, particularly between the annual meetings
of the AERA. My idea was that a technical blog, something at about
the level of Andrew Gelman’s
Statistical Modeling Blog
, something that would encourage
discussion about topics that were technical but not so technical that
only a few people could understand them, would encourage people to
spend time discussing with each other. Furthermore, by encouraging a
highly literate and technical audience, we would get technically
interesting and useful discussions.

A second reason for starting a blog is that there are a high number of
semi-technical topics that we need to work through in our field:
things which are too well understood to be the subject of papers, but
not well enough worked out to be in the standard textbooks. Take the
issue of how long an assessment should be, especially a
cognitively diagnostic assessment which measures multiple aspects of
proficiency. The [APA/AERA/NCME Joint] Standards state that the
assessment needs to be of a length that is suitable to its purpose.
Anybody who has done assessment design knows that there are a lot of
complex considerations that the designers need to balance when
deciding on the length of the assessment. My goal is not to work this
issue now (good topic for a future post), but rather to point out that
a lively discussion from multiple experts in the field would benefit
both the community of practitioners and students trying to learn both
the art and science of assessment.

But aren’t blogs dead?

I’ve seen on the internet lately (ironically from blogs I follow)
several posts indicating that blogging is a dead art. I think that
the truth is we are just now discovering what it is that blogs are
best at. Mailing lists work well for calls for papers and post-doc
positions, but they don’t really encourage discussions. After all,
there are many times in my life when I want less rather than more
email. Facebook and LinkedIn work well for keeping friends and
colleagues updated about your life, but again are not the right forum
for technical discussions. Twitter is very good at capturing
reactions to other content, but not on originated the content itself
(especially not anything that would take more than 140 characters to
explain).

So what is the niche for a blog? I think the answer is just what we
are proposing to do: short technical articles followed by technical
discussion. Unlike media companies, we don’t need an ever wider
audience to sell to advertisers to pay our salaries. Instead we need
the right audience (if you have read this far I hope that includes
you) to keep the discussion lively and interesting.

Who are the participants?

My goal initially is to have enough entries in the queue to be able to
post a new entry every week. At least initially, my goal is to get
enough co-bloggers who commit to one article per month (or even less)
that we have a good queue of material to post without anybody going
nuts (in my case, that may be too late). Andre and I have already
reached out to a number of you (although I have forgotten some of the
people I have talked to about this). If you are willing to
contribute, contact Russell Almond or Matthew Madison.

Of course, some of the most important positions in a blog are the
commenters. If you want to join the conversation you are welcome to
contribute below. As long as you have an interest in cognition,
assessment or both and are willing to maintain an appropriate level of
professional courtesy we welcome you. This is not an official
publication of the AERA or the Cognition and Assessment SIG, and
all opinions are those of the bloggers and commenters and not
representative of any particular institution, particularly the
places where they are employed.
However, we are unofficial
representatives of those institutions and we expect that all discussion
will follow the standards of professional behaviour we have come to
expect.

What do you think about this idea?

Your comments about the new blog, what its scope and rules should be,
what topics we should cover, &c are welcome below.