It was when Kate Button started her PhD, that she began to appreciate the disconnect between the methods teaching she had prior to the PhD, and the knowledge she needed for actual research. Her early statistics lessons were more about determining which line to read in the SPSS output than about the inferential process. Little was said about the connection between study design and sample size. Even less about which factors determine whether a statistically significant finding has the potential to turn out significant in a new sample. What she describes is a common theme in scientific circles: initial statistics training has to be unlearned and then re-learned in order to work well. Kate decided to tackle the problem at its roots: she how teaches undergrads about robust research before bad practices get the chance to become part of their mentality.
Even senior researchers don’t necessarily adopt the most rigorous stance when it comes to statistical power, data exploration, transparency, bias, giving weight to negative results. In science, we are rewarded for novelty, creativity, insight – not research robustness. In undergraduate research, these problems become magnified. Undergrads have limited time to run their projects. They rarely have access to interesting participant pools. No funds come attached to their studies. They are still learning the basics of research. Their knowledge of the literature is often not particularly solid either. And they know they will be assessed, so they want to dazzle us with the positive results that we like so much.
This means that numerous undergrad studies will be conducted each year, but they will tend to be small, underpowered, poorly designed. They will test novel hypotheses rather than aiming at replication, with scope for p-hacking, multiple testing, and a variety of other questionable research practices. Some of these studies will end up statistically significant. These students will have hit the jackpot – they will get their name on a paper, a significant career advantage.
But how reliable are these results? Suppose that 90% of these student projects will have a true null hypothesis, which is quite realistic. Suppose, also, that the significance level is set at 5%. The average power of these undergrad studies is about 20%. If in one hundred of these student projects there are ten true associations, this means that we will detect only two of them (20%). Of the remaining 90 null effects, we will falsely ‘discover’ four or five as significant (5%). In other words, eight of ten true associations will end up undetected, while only two out of six or seven significant findings will actually be true. Do we want to put our students in a position where their name will be attached to a finding that is more likely than not to be false?
Kate now applies three principles to undergraduate projects at her university.
- Run an a priori power calculation to know how many participants are needed to have enough statistical power to conclude something meaningful.
- Pre-register the study protocol and analysis plan, so that all the variables of interest and planned statistical tests are discussed in advance.
- Work collaboratively across institutions.
This turns small projects into big, adequately powered ones. Measuring at multiple sites increases transparency and quality of reporting, because everybody needs to follow the same protocols and operating procedures. Each student can choose a variable of interest in advance, with a hypothesis of their own to test, and results to present. And while students have less creativity and scope for individuality in this sort of work, they learn more. After all, the preference for novelty over reliability is one of the causes of the replication crisis, so why not train them to do things carefully early on.
Listen to Kate’s talk to hear more about her motivation to change how she works with undergrads, and about the many careful details of how these projects are now conducted at her university:
(This blog post is a preview of the themes in Kate Button’s talk at the Oxford Reproducibility School, held on September 27, 2017. I occasionally use my own words to describe the contents, but the presented ideas do not deviate from the talk.)