Scientific methods and their associated research questions form a variety of fits. Take the reproducibility crisis in behavioural psychology. In social psychology the crisis is quite prominent, with textbook findings being actively questioned. In cognitive psychology (inflated effect sizes notwithstanding) there is less of a feeling of crisis, and yet, the two fields use similar experimental methods. The same methods, therefore, when applied to different questions, might lead to outcomes that are not equally reliable. One might say that social psychology is about subtler phenomena and therefore more difficult to capture. What often goes unmentioned however is that psychology of individual differences, with its equally subtle questions about personality, does not have a reproducibility problem. The critical difference, I believe, is that It has carved out its independent research methods early on, working out a good fit with the questions it asks.
In the same vein, we should wonder whether all the lessons from behavioural research apply to cognitive neuroscience – whether measures to increase research robustness are a one size fits all package – or perhaps if we might benefit from carving out a series of practices that fit the kind of data we deal with. In my talk, Dilemmas of an early career researcher, I focused on the reproducibility problems I encountered regarding my own research.
One of the obvious points of departure is how we think about multiple comparisons. In the behavioural literature, we hear about the garden of forking paths – going down numerous data analysis routes but reporting only one outcome while not mentioning the remaining tests. We hear about correcting for all the comparisons involved. In cognitive neuroscience the focus is shifted from counting up the total number of tests performed, towards assessing the degree of independence between the tests. If many adjacent voxels or time points show significance, we will trust this result more (not less). But if the significant results are scattered randomly throughout the brain, or appear at disjointed moments, we will doubt them. So when faced with the notion of performing multiple statistical tests, as cognitive neuroscientists we will tend to think about quantifying the degree of dependence between their associated hypotheses, and might be uneasy about treating them as mutually independent by default.
But the main point of departure is that heavy data preprocessing happens before any statistical analyses are run. Preprocessing entails various decisions that are semi-arbitrary, such as deciding on filter cut-offs. Then there are personalized decisions on which trials are noisy enough to discard, the data might be further smoothed, they might be decomposed into frequency bands using one of several different methods, a baseline might be applied using a given time window (or a different time window), or sometimes not at all. None of these steps are driven by considerations related to cognitive questions, and a variety of ranges might work just fine without solid theoretical reasons for one set of numbers over another. However, if someone else takes our raw data and tries to answer our research question, they might well end up making a different set of equally valid decisions. We expect that our data should yield the same answers to our research questions regardless of these preprocessing steps. What I have found when fiddling with parameters is that this is not always a valid assumption.
What to do when that happens? The usual answer in the reproducibility literature is to constrain all the steps in advance, making note of all the decisions regarding handling the data. And indeed, this would allow for someone else to do exactly what I have done, and to get the same result once they run a statistical test. An alternative method is to select parameters based on how they affect the data quality (e.g. signal to noise ratio) on a part of the data where we will not be running the statistical tests. But to my mind both these approaches miss the point, which is that I am assuming my research is not affected by baselines and filtering, but by the cognitive phenomenon that I am manipulating. If small, reasonable changes in preprocessing decisions affect the cognitive answer I can give, then, to my mind, my result fails to be reproducible in a very basic sense. The two approaches outlined here simply prevent me from finding out whether this is the case.
But how does one tinker with parameters? There is a large space to explore, and a variety of possible outcomes. We don’t have the habit of reporting summary measures for such series of highly related statistics, and besides, we might not want to turn all our papers into methodological inquiries. Moreover, while data exploration is often described as a wild search where anything is blindly correlated with anything else, the reality of exploratory research is that tests are performed in steps, and in between these steps we stop and think and constrain the space of further tests based on what we’ve already observed. However, once I started thinking about how to approach this problem of the relationship between preprocessing and study outcomes, I realized I am sorely missing guidelines for exploratory research.
Then again, our large datasets come with some advantages. We can re-run our entire analyses in parts of the brain where we don’t expect to see an effect (e.g. in the ventricles) or on moments when nothing experimentally relevant is happening. If these results come up as significant, we know we have a problem. Null results can, therefore, serve as a useful sanity check. This again is not something immediately obvious in the way the reproducibility crisis is tackled in the behavioural literature.
If you would like to hear how I found my own way out of these dilemmas, click on my talk below:
(This blog post is a preview of the themes in my talk at the Oxford Reproducibility School, held on September 27, 2017.)
See here for some comments on the post.