A crisis of (p) values

I remember a sense of defeat when I started learning about research methodology. I enrolled in psychology brimming with questions, but instead of getting answers, there was this statistics course that seemed to be just caveat after caveat after caveat about what we’re allowed to conclude from data. You should have a representative sample, but you have to know what makes your sample representative for your research question, which you want to answer using your sample. You figure out an elegant design, run an ANOVA, conclude something about the factors governing your effect. But there might be some additional factor you didn’t think of, that, if included in the design, would lead to a different conclusion about how your original factors behave. You run a test, giving you a number that is supposed to tell you whether your effect is true or false, but the number your test gives you is trustworthy to different degrees depending on whether the effect is true or false, which you are trying to find out based on that number. Could we ever conclude anything with certainty?

Things started looking a lot less bleak once I got to learn about actual research. There was some initial set of facts to go on, and these could be carefully probed further. It was still clear that we can never know if what we’re uncovering is the whole truth, but we can decide something about the role of new information in relation to old information, or in relation to a theory, or to a research question. There seemed to be a series of short-cuts out of uncertainty, a number of leaps of faith over the many caveats, and as long as we agree that this is what they are, we can ask questions and get reasonable answers.

Or can we? Psychology has been going through a similar crisis lately, one of looking for reliable anchors when deciding on trustworthy methods and results. Some of the statistical caveats were perhaps not as appreciated as they should have been. Social psychology, in particular, has been taking some heavy blows, and sometimes I wonder whether part of the reason is that it has been enthusiastically relying on methods that work well for cognitive psychology, which has a more natural fit to laboratory setups. The replicability problem in psychology has to do with scientific reasoning in general, though, often with the step from analysing data to describing the results. Some problematic findings get caught when step from the results to the theoretical conclusions starts looking fishy, but researchers have also started to look at results that don’t immediately raise red flags.

After a growing body of research showed that psychology, given the usual experimental success criterion of obtaining a p-value below 0.05, has lots of experiments that don’t reach that value once re-done, Brian Nosek and colleagues set out to see how serious the problem was. They attempted to replicate one hundred psychology experiments published in high-profile journals, and the outcome wasn’t exactly stellar. If p-values are the criterion of success, then, you could take any psychology finding, flip a coin to decide whether it’s reliable or not, and you still wouldn’t be guessing well because your coin is fair whereas the real outcome was worse than chance. However, some newer analyses of these same data led to different conclusions, opening up the same set of questions that the reproducibility project was trying to close. First, Alexander Etz and Joachim Vandekerckhove published a Bayesian re-analysis of a portion of the the same results, concluding that about 30% of the replication studies provided inconclusive evidence for either the null or the alternative hypothesis, and that none of them provided strong evidence for the null. (And that a lot of this weakness of evidence has to do with sample size.) Then Daniel Gilbert and colleagues re-assessed the replications by asking whether the new results fall within the confidence intervals of the old results, and concluded that most studies do, in fact, replicate.

Rather unfortunately, the Gilbert et al. paper relied on a wrong definition of a confidence interval (these are in fact routinely misinterpreted), which led to such an outcry of shock in the statistics community that I was reassured that at least one social psychology finding replicates quite well: we tend to overestimate how widely known things known to us are (and to underestimate how widely known the things we don’t know are). Even more unfortunately, some of their descriptions of the original Nosek et al. paper seem downright deceitful. Still, the call to assess evidence based on whether new data fall into some range of uncertainty of old data is not a new idea. So we’re back to square one, trying to decide whether scientific methods can reliably tell us something about the truth regarding replicability.

On top of all that commotion, the American Statistical Association just issued a statement reminding the scientific world that: “p-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone”, “scientific conclusions (…) should not be based only on whether a p-value passes a specific threshold”, and that “by itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis”. A great way to assess evidence, right? The problem of how to extract the truth from data remains uncomfortably open.

The problem of extracting meaning from data is partly a problem of theory, but sometimes theory also comes to the rescue. As Andrew Gelman writes (emphasis mine): “I would not like to live in a world in which all those studies are true, a world in which the way women vote depends on their time of the month, a world in which men’s political attitudes were determined by how fat their arms are, a world in which subliminal messages can cause large changes in attitudes and behavior (…). If the replication rate were high, then that would be cause to worry, because it would imply that much of what we know about the world would be wrong.” But let’s imagine that all our empirical work were to reliably and conclusively show that political attitudes or moral decisions are rife with irrational processes. Would this mean that attitudes are largely irrational? Not necessarily – it might also mean that we created experiments that specifically probe these processes, and not that the (self-evident) aspects of judgment and reasoning are absent in the grand picture. The fact that these grand conclusions about irrationality caused raised eyebrows that drew attention to the (crappy) results simply shows that our views on human nature, and our research instincts, do play a role in how much we trust our findings – p-values notwithstanding.

The problem of extracting meaning from data is also a problem of consensus. We’ll never know for sure if I see what you see if we both say we see blue, but if we both adjust the colour on a dot on a computer screen to the same wavelength when asked to make it blue, we can call that good enough. How do we arrive at a good enough way to assess evidence? The main advantage of p-values is simply that they are familiar. Ever looked at a p<0.05 finding in a paper and thought “yeah, right…”? It happens all the time, doesn’t it? We’ve built a feeling for reliability of research outcomes that is not only p-value based, even if the publication and promotion system force us to put p-values and broad conclusions ahead of our research instincts. But some instincts are still there, and they evolved around these p-values. A new instinct on the research scene, a rather good one, is people saying “interesting finding, let’s see if it replicates”. But we do need to move away from p-values, and what are the alternatives?

Some of science is moving in the direction of being predictive rather than explanatory, where the goal is to apply machine learning algorithms to try to predict future outcomes by letting computers trawl through past data and look for connections. For those of us who instead like to understand the functional mechanisms underlying these connections, it looks like Bayesian statistics might be the way to go. Their main advantage is one of epistemology: the underlying assumptions about statistical inference are closer to our own theoretical assumptions. Still, a few weeks ago when I was banging my head against the wall trying to figure out the specifics of how a Gibbs sampler works from a paper that promised to be a “gentle introduction”, I realized that this might take a while.

We need to step into unfamiliar territory, without a good feeling for the inevitable quirks that new methods will have, and it won’t happen overnight. But it is happening already, and even though it’s uncomfortable to be moving through a new space where our instincts need to be readjusted, at least we know that we are moving in the direction of something better. Statistical methods are not, and never will be, a direct link to the truth. On their own, they will also never be adequate descriptors of any data. But they are a neat thing to have – the final touch, the cherry on the cake of good research, and improving them can only be a good thing.

Other than that, increase your sample sizes. Come on, just do it.

Leave a Reply

Your email address will not be published. Required fields are marked *