Paper Rubric for 2017 AERA |

As a measurement specialist, I’ve always found the AERA evaluation rubric to be a bit minimal. AERA provides the names of the scales, but little information about what goes into them. Some of that is a function of the fact that different divisions and SIGs have very different ideas of what constitutes research (qualitative, quantitative, methodological, literature synthesis). We as a SIG can do better. So please help me out with an experiment on this part.

AERA defines six scales for us (see below). The goal of this post is to provide a first draft of a rubric for those six areas. I’m roughly following a methodology from Mark Wilson’s BEAR system, particularly, the construct maps (Wilson, 2004). As I’ve been teaching the method, I take the scale and divide it up into High, Medium, and Low areas, than then think about what kind of evidence I might see that (in this case) a paper was at that level on the indicated criteria. I only define three anchor points, with the idea that a five point scale can interpolate between them.

In all such cases, it is usually easier to edit a draft than to create such a scale from scratch. So I’ve drafted rubrics for the six criteria that AERA gives us. These are very much drafts, and I hope to get lots of feedback in the comments about things that I left out, should not have included, or put in the wrong place. In many cases the AERA scale labels are deliberately vague so as to not exclude particular kinds of research. In these cases, I’ve often picked the label that would most often apply to Cognition and Assessment papers, with the idea that it would be interpreted liberally in cases where it didn’t quite fit.

Here they are:

1. Objectives and Purposes
High (Critically Significant)	Objectives and purposes are clearly stated.	Objectives and purposes involve both cognition and assessment.	Objectives touch on issues central to the field of Cognition and Assessment.
Medium	Objectives and purposes can be inferred from the paper.	Objectives and purposes involve either cognition or assessment.	Objectives touch on issues somewhat related to the field of Cognition and Assessment.
Low (Insignificant)	Objectives and purposes are unclear.	Objectives and purposes re only tagentially related to cognition or assessment.	Objectives touch on issues unrelated to the field of Cognition and Assessment.

2. Perspective(s) or Theoretical Framework
High (Well Articulated)	Both cognitive and measurement frameworks are clearly stated and free from major errors.	The cognitive and measurement frameworks are complementary.	Framework is supported by appropriate review of the literature.
Medium	Only one of the cognitive and measurement perspectives are clearly stated, or both are implicit. If errors are present, they are minor and easily corrected.	Fit between cognitive and measurement frameworks is not well justified.	Framework is not well supported by the literature review.
Low (Not Articulated)	Cognitive and measurement frameworks are unclear or have major substantive errors.	There is a lack of fit between the cognitive and measurement models.	Literature reviews clearly misses key references.

Note that implicit in this rubric is the idea that a Cognition and Assessment paper should both have a cognitive framework and a measurement framework.

3. Methods, Techniques or Modes of Inquiry
High (Now Well Executed)	Techniques are clearly and correctly described.	Techniques are appropriate for the framework.	Appropriate evaluation of the methods is included in the paper.
Medium	Techniques are described, but possibly only implicitly, or there are only easily corrected errors in the techniques.	Techniques are mostly appropriate for the framework.	Minimal evaluation of the methods is included in the paper.
Low (Well Executed)	Techniques are not clearly described, implicitly or explicitly; or there are significant errors in the methods.	Techniques are not appropriate for the framework.	No evaluation of the methods is included in the paper.

I’ve sort of interchangeably used techniques and methods to stand for Methods, Techniques or Modes of Inquiry.

4. Data Sources, Evidence Objects, or Materials
High (Inappropriate)	Data are clearly described.	Data are appropriate for the framework and methods.	Limitations of the data (e.g., small sample, convenience sample) are clearly acknowledged.
Medium	Data source only partially described.	Data not clearly aligned with the purposes/methods.	Limitations of the data incompletely acknowledged.
Low (Appropriate)	Data description is unclear.	Data are clearly inappropriate given purpose, framework and/or methods.	Data have obvious, unacknowledged limitations.

Here data (note that data is a plural noun) has to be interpreted liberally to incorporate traditional assessment data, simulation results, participant observations, literature reviews and other evidence to support the claims of the paper.

5. Results and/or substantiated conclusions or warrants for arguments/points of view
High (Well Grounded)	Results are clearly presented, with appropriate standard errors or description of expected results is clear.	Success criteria are clearly stated.	It would be possible for the results to falsify the claims.	If results are available, conclusions are appropriate from the results.
Medium	Results are somewhat clearly presented or expected results are likely to be appropriate.	Success criteria are implicit.	There are many researcher degrees of freedom in the analysis, so the results presented are likely to be the ones that most strongly support the claims.	If results are available, the conclusions are mostly appropriate.
Low (Ungrounded)	Results are unclear, or the expected results are unclear.	Success criteria are not stated and (are/could be) determined after the fact.	The results are not capable of refuting the claim, or there are too many research degrees of freedom so that can always be made to support the claim.	If results are present, the conclusion is inappropriate.

I’ve tried to carefully word this so that it is clear that both papers in which the results are present and in which the results are anticipated are appropriate. There are also two new issues which are not often explicitly stated, but should be. First, the standard of evidence should be fair in that it should be possible to either accept or reject the main claims of the paper on the basis of the evidence. Second, there are often many analytical decisions that an author can use to make the results look better, for example, choosing which covariates to adjust. Andrew Gelman refers to this as the Garden of Forking Paths. I’m trying to encourage both reviewers to look for this and authors to be honest about the data dependent analysis decisions they used, and the corresponding limitations of the results.

6. Scientific or Scholarly Significance of the study or work
High (Highly Original)	Objectives and purposes offer either an expansion of the state of the art or important confirming or disconfirming evidence for commonly held beliefs.	The proposed framework represents an extension of either the cognitive or measurement part or a novel combination of the two.	Methods are either novel in themselves or offer novelty in their application.	Results could have a significant impact in practice in the field.
Medium	Objectives and purposes are similar to those of many other projects in the field.	The proposed framework has been commonly used in other projects.	Paper presents a routine application of the method.	Results would mostly provide continued support for existing practices.
Low (Routine)	Objectives and purposes offer only a routine application of well understood procedures to a problem without much novelty.	The framework offers no novelty, or the novelty is a function of inappropriate use.	If novelty in methodology is present, it is because the application is inappropriate.	Results would mostly be ignored because of flaws in the study or lack or relevance to practice.

When I’m asked to review papers without a specific set of criteria, I always look for the following four elements:

Novelty
Technical Soundness
Appropriateness for the venue
Readability

These don’t map neatly onto the six criteria that AERA uses. I tried to build appropriateness into AERA’s criteria about Objectives and purposes, and to build novelty into AERA’s criteria about Significance. Almost all of AERA’s criteria deal with some aspect of technical soundness.

Readability somehow seems left out. Maybe I need another scale for this one. On the other hand, it has an inhibitor relationship with the other scales. If the paper is not sufficiently readable, then it fails to make its case for the other criteria.

It is also hard to figure out how to weigh the six subscales onto the overall accept/reject decision axis. This is the old problem of pushing multiple scales onto a single scale. It is a bit harder because this is an interesting relationship, being part conjunctive and part disjunctive.

The conjunctive part comes with the relationship between the Low and Medium levels. Unless all of the criteria are at least at moderate levels, or the flaws causing the paper to get a low rating on that criteria are easy to fix, there is a fairly strong argument for rejecting the paper and not representing
sufficiently high quality work.

However, to go from the minimally acceptable to high priority for acceptance, the relationship is disjunctive: any one of the criteria being high (especially very high) would move it up.

A bigger problem is what to do with a paper that is mixed: possibly high in some areas, and low in others. Here I think I need to rely on the judgement of the referees. With that said, here is my proposed rubric for overall evaluation.

Overall Evaluation
Clear Accept	All criteria are at the Medium level, with at least some criteria at the High level.	Research has at least one interesting element in the choice of objectives, framework, methods, data or results that would make people want to hear the paper/see the poster.
Borderline Accept	Most criteria are at the Medium level, with any flaws possible to correct by the authors before presentation.	Research may be of interest to at least some members of the SIG.
Clear Reject	One or more criteria at the Low level and flaws difficult to fix without fundamentally changing the research.	Research will be of little interest to members of the SIG.

The last problem is that we are using short abstracts rather than full papers. In many cases, there may not be enough material in the paper to judge? What are your feelings about that. The SIG management team has generally like the abstract review format as that makes both the reviewing faster, and it easier to submit in progress work. Should we continue with this format? (Too late to change for 2017, but 2018 is still open.)

I’m sure that these rubrics have many more issues than the ones I’ve noticed here. I would encourage you to find all the holes in my work and point them out in the comments. Maybe we can get them fixed before we use this flawed rubric to evaluation your paper.

Edit: I’ve added the official AERA labels for the scales in parenthesis, as AERA has FINALLY let me in to see them.

3 thoughts on “Paper Rubric for 2017 AERA”

Howard T Everson said:

July 11, 2016 at 11:50 am

I like the way you’ve presented this, and concur that the scoring rubrics need improvement. I am on board on this approach.

LikeLike

Divya Varier said:

August 3, 2016 at 11:10 am

Russell, thank you for providing a thorough description of your rationale and thought process in addition to the components of the 6 AERA criteria. Even though this is for the SIG, the rubric helped me evaluate my own proposals (to other venues within AERA) as i prepared to submit them. Here are some comments that I hope will be helpful-

1. Criteria 2, theoretical framework, row High, column 3: about literature review- what is appropriate would differ based on the methods- qual, quan, mixed as well as on whether or not results are presented in the proposal. For example, authors of proposals that include findings might struggle with the word count if they need to include a review of empirical literature; authors of exploratory qualitative work that presents novelty and a solid framework but no results can use the space/word limit to strengthen their argument; unpacking what is and ‘appropriate’ the literature review might be helpful; I have found that some proposals with minimal empirical literature but references to foundational work come across as strong – especially extensively researched topics (like motivation, for example).

2. As a reviewer I like to go out of my way to see through readability issues as much as possible. At the same time, if I am unable to overcome the issues, then the readability aspect might be covered, as you said, in failure to meet other criteria on account of lack of clarity in communication. That is sufficient, in my opinion. Including an explicit criterion might be unnecessarily punitive and even unfair – especially for papers from regions where English communication style/diction may be different. In this sense, readability depends on the skill ad openness of the reviewer too.

3. I find the overall evaluation criteria to be reasonable, and I appreciate the description that clearly contrasts a reject and accept. The borderline accept criteria provide objective markers within which to consider reviewers’ own judgment/ liking of a proposal.

LikeLike

- rgalmond said:
  
  August 4, 2016 at 11:36 am
  
  1. I see what you mean about the interaction between the method, the submission length and the lit review. I just finished a paper (for another conference) that used the SWEEP operator, an older technique (Beaton, 1964; Dempster, 1969). My first draft contained four pages of detailed matrix algebra that showed why this was the appropriate way of handling missing data (c.f., Little & Rubin, 2002). But I needed to cut it to basically just the references, which I felt would look like hight level hand waving to the reviewers.
  
  One solution would be to make a separate rubric, for different paper types. Another solution (to late for this year, but something to bear in mind for next) might be to alter the CFP so it is clearer that you can use a more extended format for the paper.
  
  2. I think you came to mostly the same place I did on these issues. AERA rules allow us to add categories, but not remove them.
  
  3. Thanks.
  
  LikeLike