Jason Mitchell’s essay

As of yesterday I thought the debate about replication in psychology was converging on consensus in at least one respect. While there was still some disagreement about tone, basically everyone agreed that there was value in failed replications. But then this morning, Jason Mitchell posted this essay, in which he describes his belief that failed replication attempts can contain errors and therefore “cannot contribute to a cumulative understanding of scientific phenomena”. It’s hard to know where to begin when someone comes from a worldview so different from one’s own. Since there’s clearly a communication problem here, I’ll just give two examples to illustrate how I think about science.

  • Example 1. A rigorous lab conducts an experiment using a measurement device that requires special care. The effect size is d=0.5. Later, a different lab with no experience using the device tries to quickly replicate the experiment and computes an effect size of d=0.0.
  • Example 2. A small sample experiment in a field with a history of p-hacking shows an effect size of d=0.5. Another lab tries to replicate the study with a much larger sample and computes an effect size of d=0.0.

In both cases, I’d have subjective beliefs about the true effect size. For the first example, my posterior distribution might peak around d=0.4. For the second example, my posterior distribution might peak around d=0.1. In both cases, the replication would influence my posterior, but to varying degrees. In the first example, it would cause a small shift. In the second, it would cause a big shift. Reasonable people can disagree on the exact positions of the posteriors, but basically everyone ought to agree that our posteriors should incrementally adjust as we acquire new information, and that the size of these shifts should depend on a variety of factors, including the possibility of errors in either the original experiment or in the replication attempt. Maybe it’s because I’m stuck in a worldview, but none of this even seems very hard to understand. 

Jason Mitchell sees things differently. For him, all failed replications contain “no meaningful evidentiary value” and “do not constitute scientific output”. I don’t doubt the sincerity of his beliefs, but I suspect that most scientists and nonscientists alike will find these assertions to be pretty bizarre. NHST isn’t the only thing causing the crisis in psychology, but it’s pretty clear that this is what happens when people get too immersed in it. 

11 thoughts on “Jason Mitchell’s essay

  1. The oddest thing about Mitchell’s essay is his discussion of black swans:

    “Suppose I assert the existence of some phenomenon, and you deny it; for example, I claim that some non-white swans exist, and you claim that none do (i.e., that no swans exist that are any color other than white). . . . A single positive example is sufficient to falsify the assertion that something does not exist; one colorful swan is all it takes to rule out the impossibility that swans come in more than one color. In contrast, negative examples can never establish the nonexistence of a phenomenon, because the next instance might always turn up a counterexample. . . . a single positive finding (of a non-white swan) had more evidentiary value than millennia of negative observations. What more, it is clear that the null claim cannot be reinstated by additional negative observations: rounding up trumpet after trumpet of white swans does not rescue the claim that no non-white swans exists. This is because positive evidence has, in a literal sense, infinitely more evidentiary value than negative evidence.”

    This is a bad analogy to many scientific inquiries, particular in social psychology. In the black swan case, if the hypothesis is merely “not 100% of swans are white” or “at least one black swan exists somewhere in the world,” then seeing one black swan is enough.

    But that’s hardly the sort of qutriuestion that most social psychologists seem to be interested in. They are interested in generalizable findings about how human beings think. The better analogy would be as if someone said, “All or nearly all swans in Australia are black.” If that is the claim, then it is indeed highly relevant if other people look at larger numbers of Australian swans and find that the sampled swans are mostly or all white. Yes, it could be that the “replicators” here made some mistake or that they are studying a completely non-representative sample of Australian swans, but it would be absurd to claim that their findings can never be taken as counter-evidence to a claim that Australian swans are generally black.

    So, to take the recent controversy from the Social Psychology issue (disclaimer: I helped fund that issue), Simone Schnall wasn’t merely trying to say that she had found at least one person in the world who became less morally judgmental after a cleanliness prime. She was describing an effect that could be true of all human beings. But if that effect isn’t replicated in a larger sample of human beings from elsewhere, then either the replicators did something wrong, or not all human beings demonstrate that effect, or both.

    I don’t know how anyone can claim to have a priori knowledge that such a non-replication is everywhere and always because of experimenter error and therefore has no evidential value. And it stacks the deck considerably for Mitchell to suggest that positive findings can be disputed only if someone comes up with independent proof that the original study was flawed apart from the mere fact that it wasn’t replicated.

  3. Just to echo Stuart’s comment: I heard a similar argument to the ‘black swan’ argument by Lee Ross recently. What Mitchell and Ross seem to be forgetting is sampling error. Existence proofs work for things that are black-and-white (like swans) and can be measured without sampling error. As soon as your result is probabilistic and rests on measurements that contain error, there is no such thing as an existence proof. It kind of blows my mind that this even needs to be said. But apparently it does, over and over again… (Thanks to Chris and Sanjay for contributing more substantially to this discussion).

  4. @Stuart and @simine: Totally agreed that his black swan thing didn’t hold up. And @simine, yeah it blows my mind too that this even needs to be said.

  6. @Stuart and @simine – yeah I also thought the black swan thing was totally off. Wot I rote:
    ‘But most experiments do not give us a “positive” result in this sense – they tell us the probability of a result given that the data were generated by a null distribution, not about the truth of our hypothesis. “Positive” experimental studies cannot be reasoned about in the same way as this illustration of the limits of induction.’ I have a few more thoughts on his arguments here (apologies for shameless blog pimping)- http://autap.se/8

  8. I mean, just imagine: there are Harvard professors who believe that a p-value below 0.05 does the same to their null hypothesis as a black swan does to the all-swans-are-white hypothesis! I can hardly think of a clearer sign of the need to reform the training of future scientists.

    Here are my two cents on Mitchell’s bizarre piece, from the vantage point of mathematical statistics: http://haggstrom.blogspot.se/2014/07/on-value-of-replications-jason-mitchell.html

