Though failure to
replicate presents a serious problem, even highly-replicable results may be
consistently and dramatically misinterpreted if dependent measures are not
carefully chosen. This sentence comes from a new paper by Caren Rotello, Evan Heit, and Chad Dubé to be published in Psychonomic
Bulletin & Review.
Replication hurts in such cases because it reinforces artifactual
results. Rotello and colleagues
marshal support for this claim from four disparate domains: eyewitness memory, deductive reasoning,
social psychology, and studies of child welfare. In each of these domains
researchers make the same mistake by using the same wrong dependent measure.
Common across these domains is that subjects have to make detection
judgments: was something present or was it not present? For example, subjects in eyewitness
memory experiments decide whether or not the suspect is in a
lineup. There are four possibilities.
Hit: The subject responds
“yes” and the suspect is in the lineup.
False alarm: The
subject responds “yes” but the suspect is not in the lineup.
Miss: The subject responds
“no” but the suspect is in the lineup.
Correct rejection: Responds
“no” and the suspect is not in the lineup.
It is sufficient to only take the positive responses, hits
and false alarms, into account if we want to determine decision accuracy (the
negative responses are complementary to the positive ones). But the question is
how we compute accuracy from hits and false alarms. And this is where Rotello
and colleagues say that the literature has gone astray.
To see why, let’s continue with the lineup example. Lineups can
be presented simultaneously (all faces at the same time) or sequentially (one
face at a time). A meta-analysis involving data from 23 labs involving 13,143 participants concludes that sequential lineups are superior to simultaneous ones.
Sequential lineups yield a 7.72 diagnosticity
ratio and simultaneous ones only 5.78; in other words, sequential lineups
are 1.34 (7.72/5.78) times more accurate than simultaneous ones. Rotello and colleagues mention that 32% of police precincts in the United States now use sequential lineups. They don’t state explicitly that this is because of the research but this is what they imply.
The diagnosticity ratio is computed by dividing the number
of hits by the number of false alarms. Therefore, the higher the ratio, the
better the detection rate. So the notion of sequential superiority rides on the assumption
that the diagnosticity ratio is an appropriate measure of diagnosticity. Well, you might think, it has the word diagnosticity in it, so
that’s at least a start. But Rotello and colleagues demonstrate, this may
be all that it has going for it.
If you compute the ratio of hits and false alarms (or the
difference between them, as is often done), you’re assuming a linear relation. The
straight lines in Figure 1 connect all the hypothetical subjects who have the
same diagnosticity ratio. So the lowest line here connects the subjects who are at chance performance, and thus have a diagnosticity
ratio of 1 (# hits = # false alarms). The important point to note is that you get this ratio for a
conservative responder with 5% hits and 5% false alarms but also for a
liberal responder with 75% hits and 75% false alarms.
The lines in the figure are called Receiver Operating Characteristics
(ROC). (So now you know what that ROC is doing in the title of this post.) ROC is a concept that was developed by engineers
in World War II who were trying to improve ways to detect enemy objects in
battlefields and then was introduced to the field of psychophysics.
Now let’s look at some real data.The triangles in the figure represent
data from an actual experiment (by Laura Mickes, Heather Flowe, and John Wixted) comparing simultaneous (open triangles) and sequential (closed triangles) lineups.
Every point on these lines reflects the same accuracy but a different tendency
to respond “yes.” The lines that you can fit through these data points will be curved. Rotello and colleagues
note that curved ROCs are consistent with the empirical reality and straight
lines assumed by the diagnosticity ratio are not.
Several large-scale studies have used ROCs rather than
diagnosticity and found no evidence whatsoever for a sequential superiority effect
in lineups. In fact, all of these
studies found the opposite pattern: simultaneous was superior to sequential. So
what is the problem with the diagnosticity ratio? As you might have guessed by
now, it is that it does not control for response bias. Witnesses presented with
a sequential lineup are just less likely to respond “yes I recognize the
suspect” than witnesses presented with a simultaneous lineup. ROCs based on empirical data unconfound accuracy with response bias and show a simultaneous superiority effect.
Rotello and colleagues demonstrate convincingly that this same
problem bedevils the other areas of research I mentioned at the beginning of this post but the broader point is clear. As
they put it: This problem – of
dramatically and consistently 'getting it wrong' – is potentially a bigger
problem for psychologists than the replication crisis, because the errors can
easily go undetected for long periods of time. Unless we are using the proper dependent measure,
replications are even going to aggravate the problem by enshrining artifactual
findings in the literature (all the examples discussed in the article are
“textbook effects”). To use another military reference: in such cases massive
replications will produce what in polite company is called a Charlie
Foxtrot.
Rotello and colleagues conclude by considering the consequences of
their analysis for ongoing replication efforts such as the Reproducibility Project and the
first Registered
Replication Report on verbal overshadowing that we are all so proud of. They
refer to a submitted paper that argues the basic task in the verbal
overshadowing experiment is flawed because it lacks a condition in which the
perpetrator is not in the lineup. I haven’t read this study yet and so can’t
say anything about it, but it sure will make for a great topic for a future
post (although I’m already wondering whether I should start hiding under a ROC).
Rotello and colleagues have produced an illuminating
analysis that invites us once more to consider how
valid our replication attempts are. Last year, I had an enjoyable blog
discussion about this very topic with Dan Simons, it even uses the verbal
overshadowing project as an example. Here is a page with links to this diablog.
I thank Evan Heit for
alerting me to the article and for feedback on a previous draft of this post.