Scientific methods and their associated research questions form a variety of fits. Take the reproducibility crisis in behavioural psychology. In social psychology the crisis is quite prominent, with textbook findings being actively questioned. In cognitive psychology (inflated effect sizes notwithstanding) there is less of a feeling of crisis, and yet, the two fields use similar experimental methods. The same methods, therefore, when applied to different questions, might lead to outcomes that are not equally reliable. One might say that social psychology is about subtler phenomena and therefore more difficult to capture. What often goes unmentioned however is that psychology of individual differences, with its equally subtle questions about personality, does not have a reproducibility problem. The critical difference, I believe, is that It has carved out its independent research methods early on, working out a good fit with the questions it asks.
In the same vein, we should wonder whether all the lessons from behavioural research apply to cognitive neuroscience – whether measures to increase research robustness are a one size fits all package – or perhaps if we might benefit from carving out a series of practices that fit the kind of data we deal with. In my talk, Dilemmas of an early career researcher, I focused on the reproducibility problems I encountered regarding my own research.
One of the obvious points of departure is how we think about multiple comparisons. In the behavioural literature, we hear about the garden of forking paths – going down numerous data analysis routes but reporting only one outcome while not mentioning the remaining tests. We hear about correcting for all the comparisons involved. In cognitive neuroscience the focus is shifted from counting up the total number of tests performed, towards assessing the degree of independence between the tests. If many adjacent voxels or time points show significance, we will trust this result more (not less). But if the significant results are scattered randomly throughout the brain, or appear at disjointed moments, we will doubt them. So when faced with the notion of performing multiple statistical tests, as cognitive neuroscientists we will tend to think about quantifying the degree of dependence between their associated hypotheses, and might be uneasy about treating them as mutually independent by default.
But the main point of departure is that heavy data preprocessing happens before any statistical analyses are run. Preprocessing entails various decisions that are semi-arbitrary, such as deciding on filter cut-offs. Then there are personalized decisions on which trials are noisy enough to discard, the data might be further smoothed, they might be decomposed into frequency bands using one of several different methods, a baseline might be applied using a given time window (or a different time window), or sometimes not at all. None of these steps are driven by considerations related to cognitive questions, and a variety of ranges might work just fine without solid theoretical reasons for one set of numbers over another. However, if someone else takes our raw data and tries to answer our research question, they might well end up making a different set of equally valid decisions. We expect that our data should yield the same answers to our research questions regardless of these preprocessing steps. What I have found when fiddling with parameters is that this is not always a valid assumption.
What to do when that happens? The usual answer in the reproducibility literature is to constrain all the steps in advance, making note of all the decisions regarding handling the data. And indeed, this would allow for someone else to do exactly what I have done, and to get the same result once they run a statistical test. An alternative method is to select parameters based on how they affect the data quality (e.g. signal to noise ratio) on a part of the data where we will not be running the statistical tests. But to my mind both these approaches miss the point, which is that I am assuming my research is not affected by baselines and filtering, but by the cognitive phenomenon that I am manipulating. If small, reasonable changes in preprocessing decisions affect the cognitive answer I can give, then, to my mind, my result fails to be reproducible in a very basic sense. The two approaches outlined here simply prevent me from finding out whether this is the case.
But how does one tinker with parameters? There is a large space to explore, and a variety of possible outcomes. We don’t have the habit of reporting summary measures for such series of highly related statistics, and besides, we might not want to turn all our papers into methodological inquiries. Moreover, while data exploration is often described as a wild search where anything is blindly correlated with anything else, the reality of exploratory research is that tests are performed in steps, and in between these steps we stop and think and constrain the space of further tests based on what we’ve already observed. However, once I started thinking about how to approach this problem of the relationship between preprocessing and study outcomes, I realized I am sorely missing guidelines for exploratory research.
Then again, our large datasets come with some advantages. We can re-run our entire analyses in parts of the brain where we don’t expect to see an effect (e.g. in the ventricles) or on moments when nothing experimentally relevant is happening. If these results come up as significant, we know we have a problem. Null results can, therefore, serve as a useful sanity check. This again is not something immediately obvious in the way the reproducibility crisis is tackled in the behavioural literature.
If you would like to hear how I found my own way out of these dilemmas, click on my talk below:
(This blog post is a preview of the themes in my talk at the Oxford Reproducibility School, held on September 27, 2017.)
See here for some comments on the post.
Raymundo Neto
Hi Anna,
I really enjoyed the post and lecture. It shows much of the uncertainty people working on MEG/EEG/fMRI feel about all the preprocessing steps and also about generating clear hypothesis from the outset of the research inquiry.
One thing that you mentioned, and that has never occurred to me, is that if a phenomenon is stable enough to not change after tinkering with preprocessing parameters, this should give more certainty to the presence of that phenomenon. Whereas other phenomena whose effect size changes wildly across different preprocessing parameters should be viewed with more skepticism. I really liked this idea, but it got me wondering: why do we need to adjust preprocessing steps after all? The first thing that occurred to me is that we might want to find our best estimate of effect size we are measuring, so the preprocessing steps can be used to improve the quality of the data. Just not sure if that is a good reason. What do you think?
I believe, though, that there is a caveat to the preprocessing tinkering process: if a particular effect is unstable to all the preprocessing parameter possibilities, it doesn’t mean it isn’t an effect worth of further investigation. The “correct” preprocessing parameters (or range of parameters) might be necessary to reveal the effect. I’m unsure how to solve this problem. Any thoughts?
Thanks for making all that material available!
Ana
Hi Raymundo, thanks for commenting.
What’s important here is that in the world of MEG you’re nearly always writing your own preprocessing pipeline. Often there is no strong reason to choose one set of parameters over some other set. In some cases this is done by default for you, by some software. But FieldTrip (an excellent and widely used toolbox for MEG data analysis) is purposefully made so that you, the user, have to decide on numerous parameters. The reasoning is that you should understand what you’re doing if you’re analyzing such precious data, and shouldn’t be deciding blindly on things like filters, which could very well artifactually alter your results if you don’t understand how they work. The outcome is that everybody tinkers, because that’s how you learn what you are doing.
Then one day you don’t tinker anymore because you are happy with your own preprocessing pipeline and you keep it as is. What happens to me regularly is that once I’ve got some data, I present it within the institution where I’m working, and then people start asking methods questions and inevitably suggesting that I do something differently. “Why don’t you try this?” is an easy question to ask, and since there are no strong reasons against doing things a little bit differently, people will always nod along. And then, well, you asked for feedback, it makes sense to make use of it…
Or let’s say you’re learning something new. I was learning about decoding last year and it’s not a straightforward process where I first read all the relevant papers and understood all the theory and only then went to mess around with the data. Instead, as I was collecting, I was trying things out. First to get the scripts to work, then to understand what each line means, then to change things around so that it does what I want it to, then change things around more to test the limits of what it can do… And along with that I was discovering people suggesting to smooth the data, to discard the majority of ICA components, etc. So I would go back and start from the beginning, now with one extra step. Eventually I had a set of steps that I was happy with and I continued preprocessing the last third of the subject pool with fixed parameters. I don’t think I could have understood it well enough without trying things out on real data.
(Of course ideally I’d have one full exploratory dataset and one confirmatory, about 60 subjects in total. That’s not gonna happen soon in the world of neuroimaging.)
So tinkering is just part of the process. And since it is, the question is what to do with what it gives you. You could use it to p-hack, or you could use it to strengthen your confidence in the results, or you could keep to parameters others have used before you, or you could ignore it all and consider the last set of steps the ‘true’ ones.
I think there are two importantly different types of tinkering. One is steps that clearly improve your ability to look at the data. Source localization is superior to sensor-level analyses, for example, and there is no doubt about which to trust more. You still might want to do a sensor level sanity check before going through the complicated steps of source localization, but I don’t think anyone would consider it wrong to report just the source level results in the end. The other type of tinkering is steps where there is no good reason why one thing would be better than another. In my case I had wildly different gamma band effects depending on small shifts of the baseline, and there was no reason I could see for this. Note that gamma is very faint and super noisy anyway.
I’ve seen Andrew Gelman suggest – in behavioural data – that one should just analyze the data in every possible way and present all the p-values in a graph. So instead of one t-test you run 150 t-tests where something small is changed prior to clicking the button to run the test. And then you take a look at your effect and get a feeling for its robustness. I think our data are too complex and it would be more like running 1500 t-tests for each one, which is where the exploratory constraining of the space of reasonable tests comes in.
Of course you can also just look at a part of the data that is not related to the tests you will run, and just adjust parameters to get that bit as crisp as possible, and then go on to stats from there. I like that as well. But I find it hard to not check whether my results are robust to tinkering. I would really like to be sure that someone else would get my result by looking at my data.
(N.b. Actually much later I came up with a cognitive hypothesis why my gamma-baseline thing was happening, and have been looking into it in a new experiment which confirmed my hunch – there was very likely some anticipatory pre-activation going on in the auditory cortex. But still, if I hadn’t tinkered, I wouldn’t have seen it. I think in the end you are bound to tinker at least a little bit and what you do with it will boil down to research integrity.)