At the beginning of my postdoc, I told my supervisor I’d like to learn some fMRI on top of my usual electrophysiology. “Wonderful”, she said, “I have just the dataset for you!” It was a straightforward research question with a fairly simple manipulation and it also happened to be along the lines of my interests. The data never got fully analysed after collection, and the hard drive stood there, almost 10 years, until I came along. It sounded too good to be true.
And it was. What ensued was an elaborate process of detective work where my supervisor, the researcher who collected the data, and one other colleague, rolled up their sleeves with me in an attempt to figure out what exactly was done. We weren’t sure which folder contained the final experiment, and which files belonged to the pilot. Or was it pilots? The behavioural log files probably matched up to the neuroimaging files, but it wasn’t obvious to us how. It even seemed that the instructions on the screen didn’t match up to the way that people were actually responding. Then there was another set of files, with the timings of the experimental events. We had to make sure these were accurate and linked to the correct people. It took months.
In the end we were certain that we knew exactly what we were looking at in only half of the datasets. Although we could play with the data and learn from it, we couldn’t conclude much from so few people. The disappointment of being on the cusp of answering an interesting question and then not being able to because I couldn’t trust half the data – files that were just in front of me – still stings.
Things have changed in the world of fMRI in the meantime. In his Oxford Reproducibility Lecture titled Practical tools for open and reproducible neuroimaging, Tom Nichols talked about ongoing efforts to coordinate data collection and analysis so that anywhere in the world, any other researcher can take a look at someone else’s data and know exactly what they’ve done.
The story starts in 1999 when a group of researchers from Dartmouth college won a grant to set up the process of data sharing. They went on to ask journals to require data sharing. This led to quite the backlash, with senior scientists writing to Nature and Science to prevent this from happening. It looked like nobody wanted to share their data. About 100 studies were eventually released (connected to publications in the Journal of Cognitive Neuroscience), but the project ended in 2006.
Still, enough momentum gathered and now, within the world of resting state data (functional scans collected on people who are not performing any task in the scanner, aimed at revealing which brain areas tend to light up together) it has become unusual to collect data but not share it. This is largely thanks to the 1000 Functional Connectomes Project (FCP) and the International Neuroimaging Data-Sharing Initiative (INDI). It started small, with a couple of researchers releasing their data and convincing their collaborators to do the same.
Data sharing is intrinsically linked to improving reproducibility. The organization of the data has to be thought about prior to sharing. The files could go into various folders with random labels (and the numbers will steeply increase once data analysis begins and new files are saved), but once sharing becomes part of the picture, norms begin to arise. Nowadays, there are standards for data collection which can be found on the Brain Imaging Data Structure website, and researchers begin with the same set of folders, and label their files according to the same nomenclature.
Likewise, scripts have to be reasonably intuitive, and accurately described when shared. Though preprocessing can vary, the website also suggests tools for these types of analyses. In that way, once a researcher who keeps to these standards becomes familiar with how their own dataset works, they also become familiar with how numerous other datasets work. Unlike me, they can just download and analyse away.
However, there are still numerous options available when preprocessing and analysing data. This is where the NeuroImaging Data Model (NIDM) comes in. It is an ambitious project to record all aspects of neuroimaging acquisition, analysis and results, using a common language. Once complete, a zip file with all the details of each data analysis can be added, and another person will be able understand what was done.
Listen to Tom talk about these initiatives and much more, by clicking on the video below.
(This blog post is a preview of the themes in Tom Nichols’ talk at the Autumn School of Cognitive Neuroscience, held on September 28, 2017. I occasionally use my own words to describe the contents, but the presented ideas do not deviate from the talk.)