Friday, February 2, 2018

How to Avoid More Train Wrecks

Update February 3: I added a Twitter response made by the first author. In the commentary section a comment by the second author.

I just submitted my review of the manuscript Experimental Design and the Reliability of Priming Effects: Reconsidering the "Train Wreck" by Rivers and Sherman. Here it is.

The authors start with two important observations. First, semantic priming experiments yield robust effects, whereas “social priming” (I’m following the authors’ convention of using quotation marks here) experiments do not. Second, semantic priming experiments use within-subjects designs, whereas “social priming” experiments use between-subjects designs. The authors are right in pointing out that this latter fact has not received sufficient attention.

The authors’ goal is to demonstrate that the second fact is the cause of the first. Here is how they summarize their results in the abstract: “These results indicate that the key difference between priming effects identified as more and less reliable is the type of experimental design used to demonstrate the effect, rather than the content domain in which the effect has been demonstrated.”

This is not what the results are telling us. What the authors have done, is to take existing well-designed experiments (not all of which are priming experiments by the way, as was already pointed out in the social media), and then demolish them to create, I’m sorry to say, more train wrecks of experiments in which only a single trial for each subject is retained. By thus getting rid of the vast majority of trials, the authors end up with an “experiment” that no one in their right mind would design. Unsurprisingly, they find that in each of the cases the effect is no longer significant.

Does this show that “the key difference between priming effects identified as more and less reliable is the type of experimental design used to demonstrate the effect”? Of course not. The authors imply that having a within-subjects design is sufficient for finding robust priming effects, of whatever kind. But they have not even demonstrated that a within-subjects design is necessary for priming effects to occur. For example, based on the data in this manuscript, it cannot be ruled out that a sufficiently powered between-subjects semantic priming effect would, in fact, yield a significant result. We already know from replication studies that between-subjects “social priming” experiments do not yield significant effects, even with large power.

More importantly, the crucial experiment that a within-subjects design is sufficient to yield “social priming” effects is absent from the paper. Without such an experiment, any claims about the design being the key difference between semantic and “social priming” are unsupported.

So where does this leave us? The authors have made an important initial step in identifying differences between semantic and “social priming” studies. However, to draw causal conclusions of the type the authors want to draw in this paper, two experiments are needed.

First, an appropriately powered single-trial between-subjects semantic priming experiment. To support the authors’ view, this experiment should yield a null result. This should of course be tested using the appropriate statistics. Rather than using response times the authors might consider using a word-stem completion task. Contrary to what the the authors would have to predict, I predict a significant effect here. If I’m correct, it would invalidate the authors’ claim about a causal relation between design and effect robustness.

Second, the authors should conduct a within-subjects “social priming” effect (that is close to the ones that they describe in the introduction). Whether or not this is possible, I cannot determine.

If the authors are willing to conduct these experiments--and omit the uninformative ones they report in the current manuscript—then they would make a truly major contribution to the literature. As it stands, they merely add more train wrecks to the literature. I therefore sincerely hope they are willing to undertake the necessary work.

Smaller points

p. 8. “In this approach, each participant is randomized to one level of the experimental design based on the first experimental trial to which they are exposed. The effect of priming is then analyzed using fully between-subjects tests.” But the order in which the stimuli were presented was randomized, right? So this means that this analysis actually compares different items. Given that there typically is variability in response times across items (see Herb Clark’s 1973 paper on the “language-as-fixed-effect fallacy”), this unnecessarily introduces noise into the analysis. Because there usually also is a serial position effect, this problem cannot be solved by taking the same item. One would have to take the same item in the same position. Therefore, it is impossible to take a single trial without losing experimental control over item and order effects. This is another reason why the “experiments” reported in this paper are uninformative.

p. 9. The Stroop task is not really a priming task, as the authors point out in a footnote. Why not use a real priming task?

p. 15. “It is not our intention to suggest that failures to replicate priming effects can be
solely attributed to research design.” Maybe not, but by stating that design is “the key difference,” the authors are claiming it has a causal role.

p. 16. “We anticipate that some critics will not be satisfied that we have examined ‘social
priming’.” I’m with the critics on this one.

p. 17. “We would note that there is nothing inherently “social” about either of these features of priming tasks. For example, it is not clear what is particularly “social” about walking down a hallway.” Agreed. Maybe call it behavioral priming then?

p. 18. “Unfortunately, it is not possible to ask subjects to walk down the same hallway 300 times after exposure to different primes.” Sure, but with a little flair, it should be possible to come up with a dependent measure that would allow for a within-subjects design.

p. 19. “We also hope that this research, for once and for all, eliminates content area as an explanation for the robustness of priming effects.” Without experiments such as the ones proposed in this review, this hope is futile.

Wednesday, January 31, 2018

A Replication with a Wrinkle

A number of years ago, my colleagues Peter Verkoeijen, Katinka Dijkstra, several undergraduate students, and I conducted a replication of Experiment 5 of Kidd & Castano (2013). In that study, published in Science, participants were exposed to an excerpt from either literary fiction or from non-literary fiction.

Kidd and Castano hypothesized that brief exposure to literary fiction as opposed to non-literary fiction would enhance empathy in people because of the greater psychological finesse in literary novels than in non-literary novels. Anyone who has read, say, Proust as well as Michael Crichton will probably intuit what Kidd and Castano were getting at.

Their results showed indeed that people who had been briefly exposed to the literary excerpt showed more empathy in Theory of Mind (ToM) tests than participants who had been briefly exposed to the non-literary excerpt.

Because the study touches on some of our own interests, text comprehension, literature, empathy and because of a number of reasons detailed in the article, we decided to replicate one of Kidd & Castano’s experiments, namely their Experiment 5. Unlike Kidd and Castano, we found no significant effect of text condition on ToM. We wrote that study up for publication in the Dutch journal De Psycholoog, a journal targeted at a broad audience of practitioners and scientists.

Because researchers from other countries kept asking us about the results of our replication attempt, we decided to make them more broadly available by writing an English version of the article with a more detailed methods and results section than was possible in the Dutch journal. This work was spearheaded by first author Iris van Kuijk, who was an undergraduate student when the study was conducted. A preprint of the article can be found here. An attentive reader who is familiar with the Dutch version and now reads the English version will be surprised. In the Dutch version the effect was not replicated but in the English version it was. What gives?

And this brings us to the wrinkle mentioned in the title. The experiment relies on subjects having read the excerpt. However, as any psychologist knows, there are always people who don’t follow instructions. To pinpoint such subjects and later exclude their data, it is useful to know whether they’ve actually read the texts. In both experiments, reading times per excerpt were collected.

We originally reasoned that it would be impossible for someone to read and understand a page in under 30 seconds. So we excluded subjects who had one or more reading times < 30 seconds per page. This ensured that our sample included subjects who had at least spent a reasonable amount of time on each excerpt. This would give the manipulation, reading a literary vs. non-literary excerpt optimum chance to work.

Upon reanalyzing the data for the English version, my co-authors noticed that Kidd and Castano had used a different criterion for excluding outliers. They had used a criterion that was less stringent than ours. They had excluded subjects whose average reading times were < 30 seconds. This potentially includes subjects who may have had long reading times for one page but may have skimmed another.

Our original approach ensured that people had at least spent a sufficient amount of time on each page. This still does not guarantee that they actually comprehended the excerpts, of course. For this, it would have been better to include comprehension questions, such that subjects with low comprehension scores could have been excluded, as is common in text comprehension research. 

Because we intended to conduct a direct replication, we decided to adopt the exclusion used by Kidd and Castano, even though we thought our own was better. And then something surprising happened: the effect appeared!

What to make of this? On the one hand, you could say that our direct replication reproduced the original effect (very closely indeed). On the other hand, we cannot come up with a theoretically sound reason why the effect would appear with a less-stringent exclusion criterion, which gives the manipulation less chance to impact ToM responses, and disappears with a more stringent criterion.

Nevertheless, if we want to be true to the doctrine of direct replication, which we do, then we should count this as a replication of the original effect but with a qualification. As we say in the paper:
“Taken together, it seems that replicating the results of Kidd and Castano (2013) hinges on choosing a particular set of exclusion criteria that a priori seem not better than alternatives. In fact, […] one could argue that a more stringent criterion regarding reading times (i.e., smaller than 30s per page rather than smaller than 30s per page on average) is to be preferred because participants who spent less than 30 seconds on a page did not adhere to the task instruction of reading the entire text carefully.”
The article also includes a mini meta-analysis of four studies, including the original study and our replication. The meta-analytic effect is not significant but there is significant heterogeneity among the studies.

In other words, there still are some wrinkles to be ironed out.

Tuesday, December 19, 2017

My Cattle

A while back, Lorne Campbell wrote a blog post  listing the preregistered publications from his lab. This is a great idea. It is easy to talk the talk, but it’s harder to walk the walk.

So under the notion that we don't want to be all hat and no cattle, I rounded up some replications and preregistered original papers that I co-authored.

First the replications.

I find performing replications very insightful. My role in two of the RRRs listed below (verbal overshadowing and facial feedback) was rather minor but the 2016 RRR and the issues surrounding it, on which I've blogged before, felt like an onslaught. The 2012 replication study was used to iron out an inconsistency in the literature. An additional replication study is close to getting accepted and will be added to the list in an update.

These days I use direct replications primarily when I want to build on work by others. As per Richard Feynman, before we move on we first need to attempt a direct replication of the effect we want to build on. We first need to know if we can reproduce it in our own lab.

Zwaan, R.A., Pecher, D. (2012). Revisiting Mental Simulation in Language Comprehension: Six Replication Attempts. PLoS ONE 7(12): e51382.

Alogna, V. K., Attaya, M. K., Aucoin, P., Bahnik, S., Birch, S., Birt, A. R., ... Zwaan, R. A. (2014). Registered replication report: Schooler & Engstler-Schooler (1990). Perspectives on Psychological Science, 9, 556–578.

Eerland, A., Sherrill, A.M., Magliano, J.P., Zwaan, R.A., Arnal, J.D., Aucoin, P., … Prenoveau, J.M. (2016). Registered replication report: Hart & Albarracín (2011). Perspectives on Psychological Science, 11, 158-171. 

Wagenmakers, E.-J., Beek, T., Dijkhoff, L., Gronau, Q. F., Acosta, A., Adams, R. B., Jr., . . . Zwaan, R. A. (2016). Registered Replication Report: Strack, Martin, & Stepper (1988). Perspectives on Psychological Science, 11, 917–928.

Zwaan, R. A., Pecher, D., Paolacci, G., Bouwmeester, S., Verkoeijen, P., Dijkstra, K., & Zeelenberg, R. (in press). Participant nonnaiveté and the reproducibility of cognitive psychology. Psychonomic Bulletin & Review.

Next the original preregistered studies.

I started preregistering experiments several years ago. All in all, I find it an extremely important practice, quite possibly the most important thing we can do to improve the field. After a while preregistration becomes second nature and it becomes odd not to do it.

I have no experience yet with reviewed preregistrations (other than the three RRRs that I’ve participated in). My co-authors and I submitted one over three months ago but we haven’t gotten the reviews yet.

I should add, that I've co-authored quite a few additional empirical papers during this period that were not preregistered. This is mainly because the experiments in those papers were conducted years ago before preregistration was a thing.

Eerland, A., Sherrill, A.M., Magliano, J.P., Zwaan, R.A. (2017). The Blame Game: An investigation of Grammatical Aspect and Blame Judgments. Collabra: Psychology, 3(1): 29, 1–12.
         Only Experiments 3-5 were preregistered. Experiments 1&2 were conducted in 2012.

Eerland, A., Engelen, J.A.A., Zwaan, R.A. (2013). The influence of direct and indirect speech on mental representations. PloS ONE 8(6):  e65480.

Hoeben-Mannaert, L., Dijkstra, K., & Zwaan, R.A. (2017). Is color an integral part of a rich mental simulation? Memory & Cognition, 45, 974–982.

Pouw, W.J.T.L., van Gog, T., Zwaan, R.A., Agostinho. S., & Paas, F. (in press). Co-thought gestures in children’s mental problem solving: Prevalence and effects on subsequent performance. Applied Cognitive Psychology.

Sherrill A.M., Eerland A., Zwaan R.A., & Magliano J.P. (2015). Understanding how grammatical aspect influences legal judgment. PLoS ONE 10(10): e0141181.

And finally, to show that of course I also wear a stetson, here is a theoretical paper on replication. Yeehaw!

Zwaan, R.A., Etz, A., Lucas, R.E., & Donnellan, M.B. (in press). Making replication mainstream. Behavioral and Brain Sciences.

Thursday, December 7, 2017

The Long and Winding Road of our Latest Grammatical Aspect Article

A short blog post that strings together 8 tweets that I sent out today about our new paper.

Today our latest paper on grammatical aspect appeared in Collabra: Psychology. The article reflects the times we psychologists are living in. It does so not from the lofty perspective of the methodologist or statistician, but from the work floor on which the actual scientist (**ducks**) operates.

Our first two experiments were inspired by Hart & Albarricin (2011). This research itself was inspired by some of our own work but took it from cognition into the realm of social psychology, as I described in this blog post.

As the paper explains, these experiments were run in 2012, which is why they were not preregistered. Nobody was doing preregistration at the time. We were thinking to build on Hart and Albarricin (H&A) in what some would call a conceptual replication but which is better thought of as an extension.

For the life of us, we couldn’t get an effect like that of H&A. Then we got down to business and started a registered replication project in which we performed a direct replication of H&A. Along with 11 other labs, we found no effect.

We were sidetracked by the replication project. Especially because there were some troubling issues with the initial response to our RRR, as I describe here . We were sidetracked to the point that I’d completely forgotten about our 2012 experiments.

Luckily my co-authors had not and we decided to pick up the pieces of our study. It was clear that our research could no longer be driven by our H&A-inspired hypothesis, so we took a slightly different tack.

We conducted three more experiments, now all pre-registered, which yielded some interesting new findings, which you can read about in our paper. As usual per Collabra, the data are available and the reviews are open.

Monday, August 7, 2017

Publishing an Unsuccessful Self-replication: Double-dipping or Correcting the Record?

Collabra: Psychology  has a submission option called streamlined review. Authors can submit papers that were previously rejected by another journal for reasons other than a lack of scientific, methodological, or ethical rigor. Authors request permission from the original journal and then submit their revised manuscript with the original action letters and reviews. Editors like me then make a decision about the revised manuscript. This decision can be based on the ported reviews or we can solicit further reviews.

One recent streamlined submission had previously been rejected by an APA journal. It is a failed self-replication. In the original experiment, the authors had found that a certain form of semantic priming, forward priming, can be eliminated by working-memory load, which suggests that forward semantic priming is not automatic. This is informative because it contradicts theories of automatic semantic priming. When they tried to follow up on this work for a new paper, however, the researchers were unable to obtain this elimination effect in two experiments. Rather than relegating the study to the file drawer, they decided to submit it to the journal that had also published their first paper on the topic. Their submission was rejected. It is now out in Collabra: Psychology. The reviews can be found here.

[Side note: I recently conducted a little poll on Twitter asking whether or not journals should publish self-nonreplications. A staggering 97% of the respondents said journals should indeed publish self-nonreplications. However, if anything, this is evidence of the Twitter bubble I’m in. Reality is more recalcitrant.]

I thought the other journal’s reviews were thoughtful. Nevertheless, I reached a different conclusion than the original editor. A big criticism in the reviews was the concern about “double-dipping.” If an author publishes a paper with a significant finding, it is unfair to let that same author then publish a paper that reports a nonsignificant finding, as this gives the researcher two bites at the apple.

I understand the point. What drives this perception of unfairness is our current incentive system.
People are (still) rewarded for the number of articles they publish, so letting someone first publish a finding and then a nonreplication of this finding is unfair. It is as if in football (the real football, where you use your feet to propel the ball) you get a point for scoring a goal and then an additional point for missing a shot from the same position.

However understandable, this idea loses its persuasive power once we take the scientific record into account. As scientists, we want to understand the world and lay a foundation for further research. It is therefore important to have good estimates of effect sizes and the confidence we should have in them. A nonreplication serves to correct the scientific record. It tells us that the effect is less robust than we initially thought. This is useful information for meta-analysts, who can now include both findings in their collection. And even more importantly, it is very useful for researchers who want to build on this research. They now know that the finding is less reliable than they previously thought. It might prevent them from wandering into a potential blind alley.

As with anything in science, allowing the publication of self-nonrreplications opens the door to gaming the system. People could p-hack their way to a significant finding, publish it and then fail to “replicate” the finding in a second paper. As an added bonus, the self-nonreplication will also give them the aura of earnest, self-critical, and ethical researchers. Moreover, the self-nonreplication pretty much inoculates the finding from “outside” replication efforts. Why try to replicate something that even the authors themselves could not replicate?

That’s not two, not three, but four birds with one stone! You might think that I’m making up the inoculation motive for dramatic effect. I’m not. A researcher I know actually suspects another researcher of using the inoculation strategy.

How worried should we be about the misuse of self-nonreplications? I’m not sure. One potential safeguard is to have the authors explain why they performed the replication. Did they think there was something wrong with the original finding or were they just trying to build on it and were surprised to discover they couldn’t reproduce the original finding? And if a researcher makes a habit of publishing self-nonreplications, I’m sure people would be on to them in no time and questions would be asked.

So I think we should publish self-nonreplications. (1) They help to make the scientific record more accurate. (2) They are likely to prevent other researchers from ending up in a cul-de-sac.

The concern about double-dipping is only a concern given our current incentive system, which is one more indication that this system is detrimental to good science. But that’s a topic for a different post.

Wednesday, July 26, 2017

Defending .05: It’s Not Enough to be Suggestive

Today another guest post. In this post, Fernanda Ferreira and John Henderson respond to the recent and instantly (in)famous multi-authored proposal to lower the level of statistical significance to .005. If you want to discuss this post, Twitter is the medium for you. The authors' handles are @fernandaedi and @JhendersonIMB.

Fernanda Ferreira
John M. Henderson

Department of Psychology and Center for Mind and Brain
University of California, Davis

The paper “Redefine Statistical Significance” (henceforth, the “.005 paper”), written by a consortium of 72 authors, has already made quite a splash even though it has yet to appear in Nature Human Behavior. The call to a redefinition of statistical significance from .05 to .005 would have profound consequences across psychology, and it is not clear to us that the broad implications across the field have been thoroughly considered. As cognitive psychologists, we have major concerns about the advice and the rationale for this severe prescription.

In cognitive psychology we test theories motivated by a body of established findings, and the hypotheses we test are derived from those theories. It is therefore rarely true that any experimental outcome will be treated as equally likely. Our field is not effects-driven—we’re in the business of building and testing functional theories of the mind and brain, and effects are always connected back to those theories.

Standard practice in our subfield of psychology has always been based on replication. This has been extensively discussed in the literature and in social media, but it seems helpful to repeat the point: All of us were trained to design and conduct a theoretically motivated experiment, then design and conduct follow-ups that replicate and extend the theoretically important findings, often using converging operations to show that the patterns are robust across measures. This is why the stereotype emerged that cognitive psychology papers were typically three experiments and a model, where “model” is the subpart of the theory tested and elaborated in this piece of research.

Standard practice is also to motivate new research projects from theory and existing literature; the idea for a study doesn’t come out of the blue. And the first step when starting a new project is to make sure the finding or phenomenon to be built upon replicates. Then the investigator goes on to tweak it, play with it, push it, etc., all in response to refined hypotheses and predictions that fall out of the theory under investigation.*

Now, at this point, even if you agree with us, you might be thinking, “Well what would be the harm in going to a more conservative statistical criterion? Requiring .005 would only have benefits, because then we guard against Type I error and we avoid cluttering up the literature with non-results.” Unfortunately, as many have pointed out in informal discussions concerning the .005 paper, and as the .005 paper acknowledges as well, there are tradeoffs.

First, if you do research on captive undergraduates or you use M-Turk samples, then Ns in the hundreds might be no big deal. In the article, the authors estimate that a shift to .005 will necessitate at least a 70% increase in sample sizes, and they suggest this is not too high a price to pay. But setting aside the issue of non-convenience samples, this estimate is for simple effects, and we’re rarely looking for simple effects. In our business it’s all about statistical interactions, and for those, this recommendation can lead to much larger increases in sample size. And if your field requires you to test non-convenience samples such as heritage language learners, or people with any type of neuropsychological condition such as aphasia, or people with autism, dyslexia, or ADHD, or even just typically developing children, then these Ns might be unattainable. Testing such participants also requires trained, expensive staff. And yet the research might be theoretically and practically important. So if you work with these non-convenience samples, subject testing is costly. It probably requires real money to pay those subjects and the research assistants doing the testing, and the money is almost always going to come from research grants. And we all know what the situation is with respect to research funding—there’s very little of it. But even if you had the money, and you didn’t care that it came at the expense of the funding of maybe some other scientist’s project, where would you find the large numbers of people that this shift in alpha level would require? What this means in practice is that some potentially important research will not get done.

Let’s turn now to Type II error. The authors of the .005 piece, to their credit, discuss the tradeoff between settling for Type I versus Type II error, and they come down on the side that Type I is costlier. But this can’t be true as a blanket statement. Missing a potential effect because you’ve set the false positive rate so conservatively could have major implications for theory development as well as for practical interventions. A false positive is a thing that a researcher might follow up and discover to be illusory; but a false negative is not a thing and therefore is likely to be ignored and never followed up, which means that a potentially important discovery will be missed.

Some have noted that the negative reaction to the .005 article has been surprisingly strong. A response we’ve heard to the kinds of concerns we’ve expressed is that the advocates of the .005 paper are not urging .005 as a publication standard, but merely as the alpha level that permits the use of the word “significant” to describe results. However, it is easy to foresee a world in which (if these recommendations are adopted) editors and reviewers start demanding .005 for significance and use it as a publication standard. After all, the goal of the piece presumably isn’t just to fiddle with terminology.

We think the strong reaction against .005 is also in part because the nature of common practice in different areas of psychology are not well represented by those advocating for major changes to research practice like the .005 proposal. Relatedly, we think it’s unfortunate that, today, in the popular media, one frequently sees references to “the crisis in psychology”, when those of us inside psychology know that the entire field is not in crisis. The response from these advocates might be to say that we’re in denial, but we’re not – as we outlined earlier, the approach to theory building, testing, replication, and cumulative evidence that’s standard in cognitive psychology (and other subareas of psychology) makes it unlikely that a cute but illusory effect will survive.

So our frustration is real. We would like to see the conversation in psychology about scientific integrity broadened to include other subfields such as ours, and many others.

*When we say these are standard practices in cognitive psychology, we don’t intend to imply that these practices are not standard in other areas; we’re simply talking about cognitive psychology because it’s the area with which we’re most familiar.

Tuesday, May 16, 2017

Sometimes You Can Step into the Same River Twice

              A recurring theme in the replication debate is the argument that certain findings don’t replicate or cannot be expected to replicate because the context in which the replication is carried out differs from the one in which the original study was performed. This argument is usually made after a failed replication.

In most such cases, the original study did not provide a set of conditions under which the effect was predicted to hold, although the original paper often did make grandiose claims about the effect’s relevance to variety of contexts including industry, politics, education, and beyond. If you fail to replicate this effect, it's a bit like you've just bought a car that was touted by the salesman as an "all-terrain vehicle," only to have the wheels come off as soon as you drive it off the lot.*

            As this automotive analogy suggests, the field has two problems: many effects (1) do not replicate and (2) are grandiosely oversold. Dan Simons, Yuichi Shoda, and Steve Lindsay have recently made a proposal that provides a practical solution to the overselling problem: researchers need to include in their paper a statement that explicitly identifies and justifies the target populations for the reported findings, a constraints on generality (COG) statement. Researchers also need to state whether they think the results are specific to the stimuli that were used and to the time and location of the experiment. Requiring authors to be specific about the constraints on generality is a good idea. You probably wouldn't have bought the car if the salesman had told you its performance did not extend beyond the lot. 

          A converging idea is to systematically examine which contextual changes might impact which (types of) findings. Here is one example. We always assume that subjects are completely naïve with regard to an experiment, but how can we be sure? On the surface, this is primarily a problem that vexes on-line research using databases such as Mechanical Turk, which has forums on which subjects discuss experiments. But even with the good old lab experiment we cannot always sure that our subjects are naïve to the experiment, especially when we try to replicate a famous experiment. If subjects are not blank slates with regard to an experiment, a variation of population has occurred relative to the original experiment. We've gone from sampling from a population of completely naïve subjects to sampling from one with an unknown percentage of repeat-subjects.

            Jesse Chandler and colleagues recently examined whether prior participation in experiments affect effect sizes. They tested subjects in a number of behavioral economics tasks (such as sunk cost and anchoring and adjustment) and then retested these same individuals a few days later. Chandler et al. found an estimated 25% reduction in effect size, suggesting that the subjects’ prior experience with the experiment did indeed affect their performance in the second wave. A typical characteristic of these experiments is that they require reasoning, which is a controlled process. How about tasks that tap more into automatic processing?

             To examine this question, my colleagues and I examined nine well-known effects in cognitive psychology, three from the domain of perception/action, three from memory, and three from language. We tested our subjects in two waves, the second wave three days later than the first one. In addition, we used either the exact same stimulus set or a different set (with the same characteristics, of course).

            As we expected, all effects replicated easily in an online environment. More importantly, in contrast to Chandler and colleagues' findings, repeated participation did not lead to a reduction in effect size in our experiments. Also, it did not make a difference if the exact same stimuli were used or a different set.

            Maybe you think that this is not a surprising set of findings. All I can say that before running the experiments, our preregistered prediction was that we would obtain a small reduction of effect sizes (smaller than the 25% of Chandler et al.). So we at least were a little surprised to find no reduction.

            A couple of questions are worth considering. First, do the results indicate that the initial participation left no impression whatsoever on the subjects? No, we cannot say this. In some of the response-time experiments, for example, we obtained faster responses in wave 2 than in wave 1. However, because the responses also became less varied in their performance, the effect size did not change appreciably. A simple way to put it would be to say that the subjects became better at performing the task (as they perceived it) but remained equally sensitive to the manipulation. In other cases, such as the simple perception/action tasks, responses did not speed up, presumably because subjects were already performing at asymptote level.

            Second, how non-naïve were our subjects in wave 1? We have no guarantee that the subjects in wave 1 were completely naïve with regard to our experiments. What our data do show, though, is that the 9 effects replicate in an online environment (wave 1) and that repeating the experiment a mere few days later (wave 2) by the same research group does not reduce the effect size.

           So, in this sense, you can step into the same river twice. 

* Automotive metaphors are popular in the replication debate, see also this opinion piece in Collabra: Psychology by Simine Vazire.