A couple of years ago, Brian Nosek and colleagues re-ran 100 psychological studies published in top journals during 2008 (source). The question was: will the results replicate?
It turns out that’s a nontrivial question. Sure, you might consider that any study where the p-value fell above 0.05 is a study where the effect was not found, but that’s a very crude measure. It’s quite possible to have two studies with very similar results, where one falls above and one below the 0.05 line.
What came out of thinking about this problem is the realisation that definitions of reproducibility vary. Nosek and colleagues also acknowledged this, and they also compared obtained effect sizes between the original studies and replications. An effect size tells us how strongly the experimental manipulation of interest was related to the variation in results, and some people argue that this is a more sensible measure of experimental success. After all, if you discovered a drug that is quite certain (p<0.001) to reduce patients’ symptoms by 1%, you’d probably bin it despite the statistical significance.
Effect sizes have a clear relationship with sample sizes – the smaller the true effect (in the population), the larger the sample size needed to detect it. This is why sample size is sometimes referred to as ‘power’ – the power of a study to detect an effect. If a very small sample detects a very large effect, eyebrows sometimes get raised and people wait for a replication in order to believe it.
What about if you power up your replication studies? Richard Morey, in his talk ‘The importance of power for cumulative science’ describes this as a task in comparing two pictures. Is the same person on both pictures? I can’t do this task, you say, both pictures are very blurry. Very well, the experimenter says. Here is one of the pictures in high resolution. Does that help? But, you say, the other picture is still very blurry, so how can you compare?
Richard turns the question about individual studies into a question about accumulating evidence across multiple studies, and argues that we should re-think our approach to power analyses. If we want cumulative science, we need to have high resolution pictures of each bit of research.
The discussions following the publication of Nosek’s paper exposed this problem. You want to say whether an effect is true. You have one study. Then you run a replication. This additional study should increase your confidence in whether there is an effect or not – after all, you’ve just doubled your number of datapoints. But if the studies are underpowered, you can’t learn much at all, because each of the studies will be just as unreliable.
And very often, as soon as we’ve established an effect using an experiment, we’ll want to know whether subtle manipulations can modulate this effect. This gives us a nuanced view on what works and what doesn’t, and it helps drive theory forward. However, if power is low, then it’s hard to say why one study (where some slight difference was introduced) gave a different result compared to another. Maybe it was just statistical noise? Then we end up speculating. But if at least one of the studies wasn’t adequately powered, we run the risk of drawing conclusions out of nothing but noise.
So how about this? Instead of powering each study to detect an effect in isolation, power a study so that a similarly sized study has a high probability of finding a meaningful difference from yours, if such a difference exists.
Listen to Richard’s talk to get a closer view of power, effects, and cumulative science:
(This blog post is a preview of the themes in Richard Morey’s talk at the Oxford Reproducibility School, held on September 27, 2017. I occasionally use my own words to describe the contents, but the presented ideas do not deviate from the talk.)