This text came to life with the purpose of kicking off a discussion in Kia Nobre’s Brain and Cognition lab, about best research practices. For a few years now, I’ve been following the debate on the replication crisis in psychology from the sidelines. Here, I will present what I do to integrate that knowledge into how I read cognitive neuroscience papers and handle my MEG data. The text ends with a list of suggestions for my future self, which I hope others will find useful as well.
There are some important differences between neuroscientific and behavioural approaches to experimentation. Neuroimaging experiments are significantly more expensive than behavioural studies. Neuroscience is also incredibly data-driven: a lot of the time we just like to measure and see what happens, and theory will follow. Finally, neuroimaging datasets are large enough that there is always that extra little thing to look at. Taken together, this translates into small sample sizes, weak hypotheses, and many opportunities for fishing for p-values. Uh-oh. On the other hand, we often want to draw conclusions about the thing we measured (the brain), rather than using it for defining some complex latent structure (like personality). We’re more relaxed when it comes to handling data. And we can plot them in an intuitive way. This means we can communicate about patterns and meanings of our data with relative ease. These are important advantages.
The roots of the replication crisis
The replication crisis is largely a consequence of selective publication, which itself is largely a consequence of the type of statistics that are most prevalent in the behavioural and life sciences. When we look at the reliability of an experimental effect using these methods, we can conclude something about the likelihood of its presence in the population, but we can’t say much about the likelihood of its absence. We then publish only those findings that we can make some firm conclusions about – where an effect is present. On the face of it this seems logical, but it comes with a price.
Part of the price comes from not finding out which ideas don’t work. Removing these from the pool of knowledge is like covertly removing those subjects from your sample who don’t show the desired effect. And part of the price comes from conflating theoretical and statistical inference, where a p-value of a certain magnitude (minitude!) becomes the cut-off for truth.
That there is a statistical consensus on what constitutes a truth is not the problem; that this cut-off varies across disciplines is not the problem; the problem is that p-values are highly volatile. For the sample sizes we normally have, for the types of effects we normally investigate, p-values come with very large errors of measurement. So quite apart from the measurement noise that we normally put up with, we’re also confronted with conclusion noise. Add to that an unfortunate trend of thirsting for counterintuitive single-study discoveries, and what we have in the literature becomes a distorted version of the truth.
The only thing that really needs to change here, is that experimental ‘failures’ need to become accessible to everyone.
The multiple comparisons problem, and when it’s not
Part of the dialogue about the replication crisis revolves around the questionable research practice of selectively reporting a few significant results among dozens of tests. The standard suggestion is to lower the p-threshold for significance, depending on the number of tests carried out. While a more rigorous p-value isn’t a bad thing, I have problems with this logic, and I think the only real solution is to report on all the tests. In fact, I would like to argue that multiple comparisons are often less of an issue than we assume.
To begin with, this multiple comparisons logic conflates problems within science, with problems within a single research report. If we wanted to accept a more rigorous threshold for significance, we might as well lower it as a function of all statistical tests ever carried out. But we don’t do that: we accept 0.05 as a reasonable threshold within our discipline.
The likelihood of my next finding being true does not depend on the number of tests I personally carried out. It’s not Schrödinger’s p-value, collapsing into a true effect or a false positive by the act of me taking a look. The more tests we carry out, the more false positive findings we – humanity – will accumulate, but their percentage won’t change.
The second reason why multiple comparisons aren’t such a reason for fear, is that we usually test mutually dependent hypotheses. Confirming multiple mutually dependent hypotheses increases, rather than decreases, their likelihood of being true. What really matters is that the next test is guided by a combination of earlier scientific findings and educated guesses, and not by randomly throwing variables together.
Using multiple comparisons to our advantage
One situation of testing mutually dependent hypotheses is when we are faced with replicating an effect. Finding the same effect twice makes the conclusion more, not less likely, as the Bonferroni logic might falsely lead us to assume.
Replications are important. Given how unreliable statistics are for getting at the truth, replications are crucial. Because we normally have small samples, someone else needs to find our effect even for ourselves to believe it. But it’s unrealistic to expect that we’ll use huge chunks of funding and months’ worth of time on our very short contracts, only to replicate someone else’s findings. It just won’t happen. A good way around this (and cognitive science already does this quite often) is to build replications into our own designs – and then build in additional experimental manipulations.
For example, in the world of attention research, we know what a replication would consist of: when the subject pays attention to some aspect of the sensorium, neural activity related to it should selectively increase, relative to the situation when they’re ignoring it. It works both ways: if we check for that effect, we’ve added a replication to the world of science, and we’ve used this finding as a sanity check for our own data. Manipulations in control experiments often encapsulate the features of built-in replications that we should strive for. We can then go on to probe some more specific thing about how attention works in the brain, something new and yet untested. We should also try to replicate newer, less well established effects, either of our own or those found by other researchers. So that’s my first piece of advice: replicate, and then go on. Don’t worry that you’re carrying out an additional test that wasn’t crucial to your hypothesis. It won’t hurt your findings at all.
The other place where hypotheses are mutually dependent, and therefore additional testing does not present a problem, is when the logic of how the brain works dictates that certain measures of neural activity should be mutually interconnected. This leads me to my second suggestion, something I call mid-way hypotheses.
It’s an excellent idea to start out with crystal-clear hypotheses about which type of neural activity should underlie an experimental effect, when exactly it should emerge and in what part of the brain. It’s also unlikely that most of us will be able to predict that reasonably well. And even for those who can make clear predictions, the analysis phase is a delicate dance between testing their prior beliefs and allowing the data to speak for itself.
I am usually able to come up with some very general, broadly stated hypotheses about large-scale neural activity and possible behavioural correlates. But there is so much information in my data, a whole cascade of neural events, that once I’ve taken a look at these initial effects, I like to stop (mid way) and reassess. I plot with reckless abandon here. Similar statistics and same average plots might emerge from different underlying patterns, so I don’t really know my data until I take a look at the effect in all the subjects.
If the effect is clear – great. Happened to me once. If the effect is noisy, or if I don’t know how to make sense of it, I ask myself: if this measure of neural activity behaves in this way, what other measure of neural activity should behave in a related way? For example, imagine you got a significant alpha band effect that (for whatever reason) you don’t entirely trust. Is this type of effect, when present, usually phase-coupled to gamma power? Test that! You didn’t predict it initially, but if you test and find that gamma modulation, your alpha finding will be more – not less – likely to be true. In my experience, by the time I’m in this phase of the data analysis, I can come up with pretty nice, mutually exclusive hypotheses about the type of neural activity underlying my effects. I can never do this well at the outset, there are just too many possibilities, so this is my way out.
Recalibrate! (or not…)
When I’m really unsure about my data, I do one additional thing. This is not a suggestion, it’s more of a confession. It started when I found two effects: one I could make sense of and was excited about, and one I couldn’t understand at all. I couldn’t do much about the low number of subjects I had, but I could check whether the significance I found was just a fluke, a random consequence of all those semi-arbitrary parameters I set before running the statistical test. I tried to make the effects go away.
How? By moving the baseline a little bit, then checking the stats again. By adding or removing small numbers of sensors to the region of interest. By re-calculating my spectral data using slightly different techniques. By trying to find the effect again using a whole-brain analysis. By running the cluster test many times and checking the dispersion of the outcomes. By – please don’t do this – removing subjects one by one to see if my p-value might suddenly increase when one of them goes away. I ended up concluding that one of my effects was way too volatile to be anything other than a false positive, while the other was robust. I bet you already guessed it wasn’t the one I was hoping for.
If you hear a strange slapping sound right about now, it’s the pre-registration people, facepalming. They’re probably right to. But the fact of the matter is, we don’t have a consensus about protocols on how to clean and then calculate our data prior to using statistics. This leaves a lot of room for random testing, for better or worse. I think it can work in our favour, if we primarily use statistics to convince ourselves of our results.
Telling a story
Sometimes, there are good reasons for not publishing a result. Failures of technical equipment, artifacts, excessive measurement noise, weak experimental manipulations, belatedly discovered confounds. A p-value of 0.18 should never be a reason for non-publication. It should also never, ever be a reason for a weakly supervised, poorly written, poorly structured manuscript.
But when we do publish, we need to start presenting data in a way that’s comparable across experiments. Far too often, neuroscience papers contain plots of only t-maps or model fits or contrasts between conditions or hemispheres or measures of some complex indices, but not actual raw data for each individual condition. This means that the information others get to see is guided by wobbly p-values. Plot some measure of central tendency, and some measure of dispersion. This will allow your readers to think more deeply about the nature of the effects you found.
It would be even better, of course, if you made the raw data and analysis scripts publicly available, but my own attempts to do this were tinged with technical frustration. We need a lot more fast, funded server space for this to become the norm.
And know that, when you tell your story, your p-value is only a small part of it. Maybe you had an elegant design – this is worth sharing. Maybe your hypothesis makes perfect sense in light of the current literature, but it didn’t end up working out – this is also worth sharing. Maybe your failed experiment made you reconsider your beliefs about the field. These thoughts are also worth sharing. Please tell your story, even for p>0.05.