Monday, June 3, 2013

How Valid are our Replication Attempts?

Direct replications are very useful, especially given the current state of our field. However, direct replications do have their limitations.

The other day, I was talking with my colleagues Samantha Bouwmeester and Peter Verkoeijen about the logistics of a direct replication that we have signed on to do for Perspectives on Psychological Science. We ended up agreeing that a direct replication informs us about the original finding but not so much about the theory that predicted it. We're obviously not the only ones who are aware of this limitation of direct replications, but here is the gist of our discussion infused with some of my afterthoughts.

We are scheduled to perform a direct replication of Jonathan Schooler’s verbal overshadowing effect. In the original study, subjects were shown a 30-second video clip of a bank robbery. Subsequently, they either wrote down a description of the robber’s face (experimental condition), or they listed the names of the capitals of American states (control condition). Then the subjects solved a crossword puzzle. Finally, they had to pick the bank robber’s face out of a line-up. The subjects in the experimental condition performed significantly worse than those in the control condition, an effect that was attributed to verbal overshadowing.

In this replication project we—along with several other groups—are following a protocol that is tailored after the original study. This makes perfect sense given that we are trying to replicate the original finding. The protocol requires researchers to test subjects between the ages of 18 and 25. They will be shown the same 30-second video clip as was shown in the original study. They will also be shown the same line-up pictures as in the original study. The experiment will be administered in person rather than online.

My colleagues and I wondered how many of these requirements are intrinsic to the theory. For example, the theory does not postulate that verbal overshadowing only occurs in 18-25 year olds. In fact, it would be bordering on the absurd to predict that a 25-year old will fall prey to verbal overshadowing whereas a 26-year old will not. Verbal overshadowing is a theory about different types of cognitive representations (verbal and visual) and the conditions under which they interfere with one another. So what do we buy by limiting the sample to a specific age group? It is clear that we are not testing the theory of verbal overshadowing, rather we are testing the reproducibility of the original finding and not whether the finding itself says something useful about the theory.

Let’s look at the protocol again. As I just said, the control condition (which, incidentally, was not described in the original study but is described in the protocol) is one in which subjects generate the capitals of American states. The idea behind the control condition evidently is to give the subjects something to do that involves retrieval from memory and language production, which is what they are assumed to do in the experimental condition as well.

But a nitpicker like me might argue that even if you find a difference between the experimental condition and the control condition, which the original study did and which the replication attempts might as well, this does not provide evidence of overshadowing. Perhaps it is merely the task of describing something—whatever it is— that is responsible for the effect and not the more specific task of describing the robber’s face. I’m not saying this is true but we won’t be able to rule it out empirically.

A better control condition might be one in which subjects are required to describe some target that was not in  the video they just saw. For example, they could describe a famous landmark or the face of a celebrity. After all, the theory is not that describing per se is responsible for the effect. The theory is that describing the face that you’re supposed to recognize later from a line-up is going to interfere with your visual memory for that particular face.

So even if all of our replication attempts nicely converge on the finding that the control condition outperforms the experimental condition (and effect sizes are similar), this does not necessarily mean that we’ve strengthened the support for verbal overshadowing. It is still possible that a third condition in which people describe something else than the bank robber would also perform more poorly than the state capital condition. This would lead to the conclusion that simply describing something, anything really, causes verbal overshadowing.

So the question is what we want to achieve with replications. Replications as they are being (or about to be) performed right now—in Perspectives on Psychological Science (PoPS), the Open Science Framework, or elsewhere (e.g., here and here)—inform us about the reproducibility of specific empirical findings. In other words, they tell us about the reliability of those findings. They don’t tell us much about their validity. Direct replications largely have a meta-function by providing insight into the way we do experiments. It is extremely useful to conduct direct replications and I think the editors of PoPS have done an excellent job in laying out the rules of the direct replication game.

But let’s take a look at this picture that I stole from Wikipedia. Even if all replication attempts reproduce the original finding, we might be in a situation represented by the lower left panel. Sure, all of our experiments show similar effects but none have hit the bull’s eye. The findings are reliable but not valid. Where we want to be is in the bottom right panel where high reliability is coupled with high validity.

How do we get there? Here is an idea: by extending the protocol-based paradigm. For example, a protocol could be extracted from the work verbal overshadowing that is consensually viewed as the optimal or most incisive way to test this theory. This protocol might be like a script (of the Schank & Abelson kind) with slots for things like stimuli and subjects. We would then need to specify the criteria for how each slot should be filled.

We’d want the slots to be filled slightly differently across studies; this would prevent the effect from being attributable to quirks of the original stimuli and thus enhance the validity of our findings. To stick with verbal overshadowing, across studies we’d want to use different videos. We’d also want to use different line-ups. By specifying the constraints that stimuli and subjects need to meet we would end up with a better understanding of what the theory does and does not claim. 

So while I am fully supportive of (and engaged in) direct replication efforts, I think it is also time to start thinking a bit more about validity in addition to reliability. In the end, we’re primarily interested in having strong theories.


  1. Great subject, stimulating post. I'm really worried about blind replication for fMRI experiments, and for similar reasons as you've stated. You have already nicely introduced the control condition and whether it's the most appropriate way to do it. Then there are the manifold ways an fMRI experiment can be conducted sub-optimally. When an experiment runs north of $500/hr it doesn't seem entirely reasonable not to make the replication attempt the best possible experiment. Would we really want to introduce a sub-optimal parameter, say, specifically to perform a facsimile of the original experiment? Perhaps, but only with a lot of careful thought, e.g. if we were under the impression that a systematic acquisition error was the cause of the prior (questionable) result.

    I think your focus on validity is excellent, in particular when expensive or time-consuming methods are involved.

  2. Thanks. You make a great point. The costs associated with fMRI experiments may be prohibitive for doing direct replications. Interesting how you them "blind replications" by the way. I hadn't thought of it this way but I can see how you might want to call them that in the context of expensive fMRI experiments. Of course, it makes you wonder whether such suboptimal and hugely expensive experiments should have been run in the first place.

  3. So while I am fully supportive of (and engaged in) direct replication efforts, I think it is also time to start thinking a bit more about validity in addition to reliability. In the end, we’re primarily interested in having strong theories.
    Oh God, yes please.

    Actually this is a nice example of psychology's fascination with phenomena. We are a science of effects, things that happened, and we will make no real progress until we mature into a science of mechanism, with theories about what should happen. The fact that, say, social priming is so easily disrupted (leading to a mixed bag of successful and unsuccessful replications) is just an enormous hint that we as yet know nothing about why something like social priming might possibly happen.

    I went on about this sort of problem here, it's been bugging me for years.

    1. I agree that it looks like we are a science of effects (the Stroop effect, the Simon effect, the Deese effect, the bystander effect, the spacing effect). Perhaps the lack of interest in mechanisms is the most obvious in social priming; see my earlier posts on this topic.

      It looks like things are beginning to change though.

  4. Interesting post. I agree that testing the validity of prior findings is theoretically more important than testing the reliability of prior findings. However, without reliability, there's no replicable substance to quibble about!

    In your PoPS replication effort, why not just add the better control condition you describe as a completely separate third condition (n=50) in addition to the original two conditions described in the protocol (n=50 each, for a total N=150)??? That way, you could simultaneously contribute to both the reliability and validity questions!!

    1. I'd thought of this as well. It's a great way to kill two birds with one stone. For this particular project, though, it might not be feasible for us given the size of our subject pool. We were going for n=120; I don't think we can run 180 subjects in the time given.