Failed Replications and “Emptiness”

philosophy, science Leave a reply

Recently, a Jason Mitchell of no less than Harvard University published a piece of writing entitled “On the emptiness of failed replications“, within which Mitchell decries the focus on replications within Social Psychology, and (to, I hope, a lesser degree) within science as a whole. I found it an interesting read, and an excellent example of how references/citations can serve the purpose of signalling (i.e. name-dropping) rather than adding anything substantive to a paper. Invoking Quine and Kuhn certainly signals that one has a passing familiarity with Philosophy of Science, but the rest of the essay quickly highlights how mistaken that impression is.

While I think it’s important for popular science sites to highlight this kind of thing, as Annalee Newitz at io9 did, the article just abuses Mitchell: there’s no explanation of how he’s wrong. This is, I think, a common problem within skeptic/atheist/science-enthusiast circles whereby “let’s point and laugh” is often substituted for understanding the problems. Here’s my breakdown of Mitchell’s article.

Mitchell’s opening two paragraphs are right on the money with regards to the practices of modern scientists: as Lakatos observed, in many research projects there is a core hypothesis and then there are several auxiliary hypotheses. These auxiliary hypotheses include “the equipment is working correctly”, “an fMRI is the best way to generate the data I want”, “clothing has no effect on the fMRI machine” and so on. So when an expected result is not found, it’s the case that these auxiliary hypotheses are the first to be heavily inspected. It’s not the case that the core hypothesis is immediately thrown out (or, as Mitchell says, “that logic and mathematics suffer some fatal flaw”).

An example of this (which I’m stealing from Massimo Pigliucci, as discussed on the Rationally Speaking Podcast #29), let’s talk about the discovery of Neptune. At the time, Uranus was the outer-most planet that was known, but its orbit did not match the expected orbit as generated from Newtonian mathematics. For years astronomers resisted rejecting Newton’s theories, and hypothesized that there was a planet beyond Neptune. Why? Because one of the starting auxiliary hypotheses was “there are no planets beyond Neptune”, and that was discarded in favour of additional study.

Of course, when Mercury’s orbit was discovered to be a mismatch for the expected orbit (as per Newtonian mechanics), they likewise hypothesized an additional planet and they were wrong. Turns out that they actually had to throw out Newtonian mechanics to explain the orbit of Mercury.

So Mitchell’s opening paragraphs, at least, aren’t wrong, they’re just an extremely superficial understanding of one of the main problems within Philosophy of Science: at what point do we know that the core hypothesis of a research program is simply wrong? I think where Mitchell goes off the rails is with the following claim:

To put a fine point on this: if a replication effort were to be capable of identifying empirically questionable results, it would have to employ flawless experimenters.

He brings this up more than once, and this is a clear example of a false dichotomyeither we employ flawless experimenters, and then a failure to replicate indicates that a core hypothesis is wrong or we don’t employ flawless experimenters, in which case a failure to replicate never indicates that a core hypothesis is wrong. Mitchell’s article really hammers down on this point over and over again.

And here is the rub: if the most likely explanation for a failed experiment is simply a mundane slip-up, and the replicators are themselves not immune to making such mistakes, then the replication efforts have no meaningful evidentiary value outside of the very local (and uninteresting) fact that Professor So-and-So’s lab was incapable of producing an effect.

When Mitchell says these things, he’s presenting a hopelessly naive understanding of science in general.

  1. Just because it’s unlikely that we can eliminate all routine errors from an experimental run does not mean that it’s impossible
  2. Replication A may have failed due to a mundane slip-up, as did Replication B, but those slip-ups may be different, meaning that we can learn about different slip-ups
  3. A large body of failed replications require us to adjust our Bayesian prior probabilities: what are the odds that every single lab failed to replicate the experiment, vs. the original lab made an error that generated the result?

The Perfect is Not the Enemy of the Good

We don’t have flawless experimenters, so any argument that asserts we must have them in order to progress science is nonsense. What we do have is a large quantity of skilled (to various degrees) experimenters, and we should trust the results of a given lab based on the known skill of those particular experimenters. Learning the “very local … fact that Professor So-and-So’s lab was incapable of producing an effect” is not uninteresting, especially if they demonstrate a history of making the same, or similar, mundane effects: we learn that we should not trust results from that lab (which means, in practice, to pay extra scrutiny to whatever results they actually produce). This seems like a somewhat critical result, given that science is little more than a network of trusted experimenters and results. What happens when an experimenter is known to have committed fraud? Every paper they produced is reviewed, because the trust has been violated.

Our inability to produce flawless experimenters does not mean that we are unable to produce good experimenters, which means that the people within a field will get a sense of which experimenters are doing good, detail-orientated, meticulous work, and which experimenters are more inclined to rushed and sloppy work. And, of course, it’s entirely possible (and likely) that even the best experimenters will generate some errors, but the odds of them generating systemic errors that bias the entire experiment are reduced when we do more and more replications. At some point, as I mentioned above, we need to stop the replications and note that while one (1) lab produced results, five or six labs failed to do so, and while false negatives are absolutely a problem, so are false positives. Mitchell seems to be under the impression that they simply don’t exist (or, at least, that’s the impression that his article gives).

A Scientific Experiment is Not Reducible to a Recipe in a Cookbook

Mitchell’s responses to what he considers the most likely rejoinders are painful. His first rejoinder is that replicating someone else’s experiment is akin to repeating the recipe in a cookbook: if one fails to produce results, then it’s completely the fault of the inept cook, and not the recipe. This response fails to recognise the difference in complexity between baking cookies, and determining if people can score differently in math tests when primed to think of themselves in terms of race as opposed to gender. The two situations are qualitatively different in significant ways: attempting to evoke a chemical reaction in some easily prepared food mix is radically different from asking a large group of people to think in precise ways, which largely assumes that a significant proportion of the room has internalized certain stereotypes about race and gender. I find it hard to believe that his claim here is made in good faith.

While it’s certainly true that there are a number of unstated components to the art that is ‘doing science’, Mitchell’s response that ‘well, I guess just everyone else sucks at it, so stop trying to replicate my findings’ seems both puerile and inappropriate. This argument seems to express a basic failure to understand the nature of P-values, which are generally highly prized in Social Psychology. To have discovered an effect at P < 0.05 is to claim that the probability of that result occurring by chance is under 5%: how is Mitchell guaranteeing that the results that he is so enamoured of are not the result of mere chance? This is the very thing that replication helps us to determine!

Flimsy Effects are Definitely Worth Studying

Interestingly, I agree with Mitchell when he rails against those who claim that effects that are not robust, or easily reproducible, should not be studied (although I’m not sure how prevalent that view actually is), but I find a deep and bizarre conflict in this section, as compared to his earlier writing. Has he not spent several paragraphs all but decrying those who cannot reproduce findings as incompetent fools who should be drummed out of the profession? Has Mitchell not spent several paragraphs telling us that (and I quote): “the replication efforts have no meaningful evidentiary value outside of the very local (and uninteresting) fact that Professor So-and-So’s lab was incapable of producing an effect.”

How does one square this circle? I mean it’s pretty clear that Mitchell seems to believe that people who cannot produce results are incompetent, yet here he is proclaiming that just because you don’t find a result doesn’t mean you should stop searching. If one is (quote) “incapable” of producing an effect, shouldn’t one pack up and focus on latte art instead? Mitchell seems to be of two minds about this…

The Asymmetry Between Positive and Negative Evidence

I think Mitchell’s third response to his imaginary critics is quite the excellent piece of rhetoric, insofar as he has moved the goalposts far, far away from Social Psychology in order to pretend that he’s talking about solid physical objects which can be clearly seen. The ‘black swan’ concept has long been a favourite in Philosophy of Science, as it’s quite the striking example. However, Mitchell is making a Category Error in his complaint here. A few moments ago we were discussing implicit prejudice, and now we’re suddenly discussing the existence of black swans.

The clever rhetorical trick that he’s pulled here is that when it comes to implicit prejudice, we’re primarily interested in how prevalent it is in society, not whether it simply exists. Or, to put it another way, if only 100 people in the world experienced implicit prejudice, it would an effect that has little to no impact on the world. So knowing whether it affects 100 people, or 7,000,000,000 people is important. But analogously, that would mean that yes, it’s important to note all the white swans because that would tell us not that the black swans have magically stopped existing, but it would tell us how rare they are.

An issue within Social Psychology is not only that effects are flimsy (i.e. difficult to replicate), but also that the effects may themselves be rare. Given that most Social Psychology studies are done on a very specific population, it’s necessary to replicate those experiments with different populations to check whether the results truly can be generalised to the rest of the world, or just restricted to the early 20s, white, American University population they are typically conducted upon.

Mitchell consistently misunderstands (one hopes not willfully) that these replications are not done purely to cast aspersions on the original experimenters, but to allow us to place the original experiment in the correct context, which helps us to understand whether the original experiment represents a rare and unusual phenomenon, or a common phenomenon, or (at worst) a phenomenon created through fraud. Mitchell claims to be against this last possibility, yet offers no alternative solution to replication for weeding out fraud (which is more common than one would hope).

The rest of Mitchell’s diatribe consists of little more than casting aspersions on people who understand science, so I’m not going to address it, but I’d like to point out a serious issue with Mitchell’s complaint: at no point does he restrict his complaints to the replication of Social Psychology. His complaints, by invoking black swans, noble gases and the like, are against replication in science as a whole. I, for one, am staggered by this bizarre notion. Certainly, restricting his complaint to Social Psychology would to little to strengthen his argument, at least he would have reasonable grounds to suggest why replication is so difficult (i.e. the complexity of humans as compared to the interaction of two known chemical compounds). But his criticism is broad and applies to science as a whole.

In that regard, Mitchell displays profound ignorance of the contemporary and serious issues in modern scientific practice. While he touches on the ‘file drawer effect’ in his ‘asymmetry’ section, his dismissal is as ignorant as any conspiracy nut’s dismissal of the facts. It is a fact that pharmaceutical companies produce vast quantities of research that show no results, until finally that ‘by chance’ study shows up, and they publish that in support of their new wonder drug (that really isn’t any better than anything already on the market). Pfizer is on the record as doing this exact thing, and there are many, many more. How would we tell that these drugs (generally speaking) are actually useless without attempting to replicate the results? Are we to conclude that all the scientists who failed to replicate the by-chance result that Pfizer published are actually incompetent hacks, as Mitchell suggests? This seems to be a wholly inadequate response.

Replication is, to anyone with a basic understanding of the process of science, a critical part of science. Peer-review doesn’t end when the article is published: attempts to replicate are peer-review. And yes, when the results are not replicated that may (and should) bring a spot-light down on both labs to ensure that best practices are in play. Complaining about people attempting (and failing) to replicate your work is certainly NOT the act of a cry-baby (who are these unnamed bullies?), but it is absolutely the act of someone who has less than a passing understanding of the scientific process.

I guess Mitchell is in good company.

Follow Brian on Twitter!

Leave a Reply