This is going to be one of those annoying posts where I tell you to first go and read something else before coming back.
Sometimes, when research results are too good to be true, people start thinking there might be something fishy going on. Jens Foerster, for instance, was called out on the excessive linearity of his effects. He’d measure some cognitive variable, then apply one experimental manipulation that should decrease the expression of this variable, and another manipulation that should increase it. Too many times, the increase and decrease were just about equal in magnitude. What the nasty data police then tested is what the odds are of getting such a beautifully linear effect from his small samples, given the assumption that the effect in the population truly is linear. And the chances of getting such clean results, so many times in a row, are virtually nil. Real data are noisy; these were too good to be true.
Yesterday I came across a blog post that’s too good to be true in a different way. It was a post that outlined problematic research practices in such a neat manner that I wondered whether it might be satire. But it’s not, and I’ll do a quick recap, but really, do go and read it.
What it describes is the winning strategy for staying in science, through the contrasting behaviour of two lab members. The protagonist is charmingly called The Turkish Woman (it must be satire!), a person who throughout the text complies with her supervisor’s instructions. As her reward, she publishes five papers. These are referenced at the bottom of the post.
What she is initially introduced to is a ‘failed study which had null results’, a rich dataset where the hypothesis was not confirmed. She is told these data were expensive and time-consuming to acquire – and that there must be something in there that can be salvaged. (It’s going to be a cautionary tale about incentivising researchers to p-hack!) Every day, they pore through the data, reanalyse it in new ways, and come up with a different set of hypotheses. (It’s going to be about hindsight bias!) This goes on until a variety of discoveries are made while ‘digging through the data’, and a set of papers gets published. (Read it!)
The other lab member is a postdoc who was not interested in being involved in this. They publish less, they leave academia, and their main role in the post is to provide a contrast to the winning strategy.
It is a post that aims to accentuate hard work, efficiency, capitalizing on opportunities, a collaborative spirit, and dedication. It ends up highlighting questionable research practices, misrepresenting exploratory research as confirmatory, and a lack of understanding why null results are important.
I took a peek at three papers from this series. None of them mention that other publications have come from the same dataset. They all mention only a selection of tested variables, not all, as if only a few things were measured each time. These are linked to hypotheses that are made to look like they existed before the data were collected.
The post even outlines how to data mine. You have a hypothesis, but maybe it didn’t work. But did it work at lunch, if not at dinner? Did it work with small groups and not large groups? One can go on like this, creating new combinations of variables until a result shines through. Statistical tests are a messy business, our criteria are not stringent, the samples are small, something is bound to come up as significant if we look hard enough. Now, it’s significant whether we looked at it or not – testing the data in many different ways is not the problem. The problem is not reporting all the other variables that were collected and all the other tests that were carried out. Because if we knew this was one result out of, say, 200 tests, then we would be less likely to give it much credence. Especially compared to a situation where it’s exactly the thing that the researchers had a hypothesis about, and lo and behold, it turned out to be true!
Critically, when a result like that becomes part of the published record, there has to be a way of disconfirming it. Considering an experiment failed because a hypothesis was not confirmed, means that null findings won’t get published. What we end up with in the literature is a whole lot of random noise, which nobody gets to correct.
The author of the post is a professor at Cornell, well cited and successful. The post stings because there is truth to it – the person producing such noise folded into compelling narratives might well be more likely to make it in academia than the person who would publish the null result instead. I was not the only one to think it must be satire. But it’s certainly educational. Too good to be true, such a description of the driving forces of the replication crisis laid bare, for all eyes to see.
Andrew Gelman
Hi, it’s actually not so clear that the story happened exactly as Wansink described it (see P.P.P.S. here: http://andrewgelman.com/2016/12/15/hark-hark-p-value-heavens-gate-sings/), which points to a paper that claimed that that The Turkish Woman collected the data herself, which contradicts the claim in the blog post that she came into the study after the experiment had been performed.
To me, the worst thing about the entire episode–even beyond the Bruno-Frey-like move of writing four papers that manage to cite zillions of other Wansink articles without citing each other– is that Wansink never gets around to stating his Plan A that motivated the experiment but which, in his telling, “failed.”
That should be news, no? If you really care about your research and you think you have a great idea, and then it doesn’t work, I’d think you’d want to share that with the world, as this is one lead that nobody needs to follow up. It seems like poor research practice to keep one’s failures a secret. If you have the energy to publish 4 separate papers about bad ideas that seemed to have worked, why not publish 1 paper about the good idea that failed? It’s not like he’d even need to write a new paper, even: he could’ve just added it to the existing paper, explaining that the study was motivated by a certain attractive hypothesis that was not borne out by the data.
This is supposed to be science, right? Not “Freakonomics.”
Ana
That’s odd about mentioning her as someone who collected the data.
My impression is that his intentions are fully good – he’s spreading the word on how to make it in science, and doesn’t know how problematic these research practices are. His next post was on how to storify your findings, how to make them appealing. I almost feel sorry for him for getting into the line of fire.
Of course the sting for me is that he is also describing what goes on in cognitive neuroscience all the time. It’s sensible to look at the data as you go along, but then you’re biasing yourself, and we have no good set of standards for reporting these (perfectly valid) exploratory behaviours.
As for his post, the biggest problem is the general attitude towards nulls of course. Not just the ‘failed’ experiment, but the other tests and variables he doesn’t report on. If it wasn’t significant, it doesn’t matter, it shouldn’t be mentioned. Just change that and the entire approach would change.