In response to charges that their field is churning out unreliable science, psychologists this month issued a defense that may be tough to dispute. At issue was a claim, published in the journal Science, that only 39 of 100 experiments published in psychology papers could be replicated. The counterpoint, also published in Science, questioned the assumption that the other 61 of the results must have been wrong.
If two experimental results are in conflict, who’s to say the original one was wrong and not the second one? Or maybe both are wrong if, as some argue, there’s a flaw in the way social scientists analyze data.
This is an important puzzle, given the current interest in drawing conclusions from huge sets of data. And it’s not just a problem for psychologists. Researchers have also had trouble replicating experimental results in medicine and economics, creating what’s been dubbed “the replication crisis.”
Some insights come from a new paper in the Journal of the American Medical Association. While previous discussions of the replication crisis have focused on the way scientists misuse statistical techniques, this latest paper points to a human fallibility component – a marketing problem, which boils down to a universal human tendency, shared by scientists, to try to put their best foot forward.
At the center of both the math and the marketing problem is the notion of statistical significance – roughly, a measure of the odds of getting a given result due to chance. Computing statistical significance is a way to protect scientists from being fooled by randomness. People’s behavior, performance on tests, cholesterol measures, weight and the like vary in a random way. Statistical significance tests can prevent scientists from mistaking such fluctuation for the workings of a drug or the miracle properties of artichokes.
Statistical significance in medicine and social science is expressed as a p value, which represents the odds that a result would occur by chance if there’s no effect from the diet pills or artichokes being tested. Popular press accounts make much of their potential for trouble in the hands of scientists. A headline at the website “Retraction Watch” claimed, “We’re Using a Common Statistical Test All Wrong,” and Vox ran with “An Unhealthy Obsession with P-values is Ruining Science.”
It’ll take more than that to ruin science, though, since many fieldsdon’t use p values the way clinical researchers or social scientists do. The problem, as JAMA author John Ioannidis sees it, is partly in the way medical researchers use p values as a marketing tool.
Statistical significance is a continuum – a measure of probability — but in medical research it’s been turned into something black or white. Journals have informally decided that results should be considered statistically significant only if the p value is 5 percent or lower. (Since most scientists hope their results are not due to chance, lower is better.)
Ioannidis worries that researchers are making too much of this arbitrary cutoff. He sifted through millions of papers and found that most advertised their statistical significance up high, in the abstract, while burying important but perhaps less flattering aspects of the study. A statistically “successful” drug may only reduce the risk of a disease from 1 percent to 0.9 percent, for example, or raise life expectancy by 10 seconds.
Just as food manufacturers have advertised all manner of products as low-cholesterol, all natural, fat- or sugar-free, hoping to give the impression of health benefits, so scientific papers have advertised themselves as statistically significant to give the impression of truth.
The same 5-percent cutoff is used in a lot of social science and has been a source of trouble there too.
In 2011, the psychologist Uri Simonsohn showed that it’s all too easy to produce bogus results even in experiments that clear the 5-percent p-value bar.
He set up an experiment to show that he could use accepted techniques to obtain a result that was not just ridiculous but impossible. He divided students into two groups, one hearing the song “Kalimba” and the other “When I’m 64.” Then he collected data on both groups, looking for differences between them.
He found something that varied by chance – the ages of people in the groups — then, using math tricks that had been common in his field (but are considered cheating by statisticians), he showed that it was possible to come up with a statistically significant claim that listening to “When I’m 64” will make people get 1.5 years younger.
Statistician Ron Wasserstein agrees that there is a right way and a wrong way to use statistical tools. And that means those trying to replicate studies can also get it wrong, which was the concern of those psychologists defending their field.
Getting a different p value in a replication effort is not enough to discredit an existing study. Imagine, he said, you are trying to replicate a study that showed that cats gained weight eating Brand X cat food. The original result shows the cats got fatter, with only a 2 percent chance that this happened by chance. A new study also shows they got fatter, but with 6 percent odds that it’s by chance.
Is it fair to call the original experiment a failure because the second result missed the 5-percent p-value cutoff? Should we assume that Brand X is not fattening? There’s not enough information to draw a conclusion either way, Wasserstein said. To get an an answer you’d also want to know the size of the effect. Did the cats gain pounds or ounces? Did the cats eat more of the food because it tasted good, or was it more fattening per bowl? Statistical significance has to be weighed alongside other factors.
Science is a way of seeing the world more as it is, and less as we’d like it to be. Statistical techniques were invented by people who dreamed that the power of physics and chemistry might extend to a world of previously unpredictable phenomena, including human behavior. There may yet be something to it, once people work out the kinks.