"I still think replication is an extremely important part of science, and I think that’s one of the really great things about this project," Reinhard said. But he's also come to a more nuanced view of replication, that sometimes the replications themselves can be wrong, too, for any number of reasons.
"The replication is just another sort of data point that there is when it comes to the effect but it’s not the definitive answer," he added. "We need a better understanding of what a replication does and doesn’t say."
This is really the take-home point here and I wish the author of this article had spent more time emphasizing this.
A lot of things fail to replicate, as we've learned from this project and from similar replication projects in other fields. But the story is much more complicated than that. One large part of the story is unpublished replications.
The final experiment represented in a paper is almost certainly not the only time this experiment was run in the lab. Just as an example, I've now run an almost identical experiment 3 times with slightly different parameters and I need to run at least one more version before I feel confident that it'll be publishable. But that last iteration of the experiment is going to be the only thing presented in the paper, because you generally don't report earlier iterations of experiments (since they're meant for fine-tuning different parameters). The conclusion from the first three experiments is nearly identical -- that is to say, it's already replicated twice. If someone tried to replicate my experiment and it failed, it's not the case that there was an effect in one instance and not an effect in the other. There was an effect in 3 instances, and no effect in just one. But we have no real way of knowing the "real" replication rate since earlier iterations of experiments aren't published.
On the other hand, there are some cases where many earlier iterations of an experiment don't "work", and then finally one does and boom it gets published. The replication rate on this study is already terribly low -- and a failure to replicate by another lab would be its death knell. But again, we don't have access to this data.
Another type of unpublished replication is experiments that build on the original finding of a paper, fail, and are subsequently not published. For example, let's say that a new paper reports that seeing happy facial expressions before doing some task makes people better at the task than seeing sad facial expressions. Now, I'm curious whether this extends to other facial expressions. So I do an experiment where I look at happy, sad, surprised, and angry faces. But when I do the experiment, I don't replicate the original finding between happy & sad expressions. I can't be sure whether this is because the effect isn't real, or if I set something up wrong in my experiment. In either case, null results aren't publishable and so I can't publish this paper. Now imagine a bunch of other people trying this and getting the same result. No one ever really finds out about these failed studies because they weren't published, and so it's assumed that the original finding stands.
This is somewhat of a tangent, but I think it's interesting to ask why there are so many failed replications. These were studies that were published in top psychology journals, meaning that the experimental design is very high quality. If the experimental design was so good, then why are they getting false positives? One potential answer is...
Unfortunately, there's rampant statistical illiteracy among scientists. You would think there isn't but there is. Just go to /r/askstatistics and see all the grad students asking how to do a t-test...it's horrifying. And using inappropriate statistical methods can happen in the original research, or in the replication, or both. So you can't necessarily trust a replication to be the end all be all for that finding; it could be the case that the replicators used an inappropriate method.
Beyond data analysis tools, even just choices like how many subjects and how many items to use is a crucial decision that a lot of researchers don't know how to make. Most scientists have no idea how to do a basic power analysis to determine how many subjects they should recruit in order to detect an effect of any given size. So if you expect your effect size to be small, you're going to need a lot of subjects to reliably detect it. However, if the real effect size is 0 (i.e., no effect), you also need a lot of subjects to reliably detect the non-difference. So if you don't have an appropriate number of subjects, you can very easily inflate your false positive rate. Couple this with the fact that a lot of researchers keep recruiting subjects until their analysis comes out significant, and you have a trainwreck of false positives all over science.