If there was one lesson to be learned from this November’s Council for Responsible Nutrition (CRN; Washington, DC) annual conference, it is this: the nutrition science paradigm as we know it desperately needs a rethink. The scientists speaking at the event discussed why today’s nutrition research models and tools need changing, touching on everything from the pitfalls of meta-analyses to the shortfalls of the randomized controlled trial (RCT) design for nutrients. Another problem? The way scientists employ statistical analysis to draw conclusions from study results. A statistician herself explained to conference attendees why today’s researchers are misapplying statistical tools in today’s nutrition studies.
Researchers heavily use the p-value (a value between 0 and 1) as a measure of the strength of study results. In very simplified terms, the lower the p-value (less than 0.05), the higher the probability that the null hypothesis is not true—meaning that there is in fact a higher probability that there is actually a relationship between a nutrient and a proposed physiological effect. By contrast, if a p-value equals or surpasses 0.05, the strength of the study’s results is debated. Research journals, for instance, usually don’t publish studies whose p-values are above 0.05.
But, according to Regina Nuzzo, PhD, a statistics professor at Gallaudet University (Washington, DC), this is where we go wrong.
Today, we assume that the p-value is a “be all, end all” indicator of clinical significance. And if a p-value is above 0.05, we reject the findings entirely. But, Nuzzo argued, the p-value was never designed to be a tool by which to “make decisions”—decisions such as whether a study’s hypothesis is definitely true or false. Instead, she said, the p-value simply tells us whether data may be “worthy of a second look” and merit further investigation.
“Research has shown that people have this sort of feeling that rather than a p-value being a continuous measure [of] the strength of evidence, people are making it into a black-and-white issue, and the real danger is that [every study]…that’s above 0.05 ends up in the filing drawer, never to be published,” she said.
There are many reasons why a p-value should not be considered the sole determinant of clinical significance. One reason is that in a “tough research environment”—the research environment most nutrition researchers in fact find themselves in—p-values can easily become inaccurate, because nutrient studies are often “underpowered,” making for “noisy data”—for instance, sample sizes are likely to be small, effect sizes are likely to be small, and study populations are often highly diverse. And there is also that larger, serious problem, which we've discussed numerous times here in our publication: the fact that the typical RCT study model is a poor fit for nutrition research in the first place.
“What it really means is that when we have that 0.05 threshold for publication, essentially it’s too high,” Nuzzo said. “We’ve set the bar too high, and by design, no study is actually strong enough to make a definitive judgment. It just won’t work.”
The misuse of the p-value today tricks many of us into thinking that associations are stronger than they really are. “When we do see…a finding…that makes it over the threshold, we say, ‘Ah, yes, that’s wonderful.’ We seize upon it and publish it, when really it wasn’t that strong," Nuzzo said. "It appears stronger than it really was because it just worked out in that sample—you had the perfect sample, the perfect conditions to send [the p-value] over the threshold…It just had the random good luck to be there.”
We reject a lot of research today simply for lack of meeting a target p-value, she said. And one can imagine the implications. Many good nutrient studies are being ignored—for follow-up investigation, by policymakers, etc.—because we put so much stock in the p-value.
As a result, “Over the past few decades, there’s been increasing concern from researchers and statisticians about the quality of published scientific findings,” Nuzzo said. Also, “think about all the wasted time, money, and energy [that we spent] following up on things that were just not true.” What we need to do, she said, is to face the “uncomfortable fact that…the statistical tools that we’re using today were never designed to be use the way that we’re using them today.”
Unfortunately, most policymakers, journalists, consumers, and even scientists may not dig deeper and instead blindly assume that statistical tools are always objective and accurate. We take p-value conclusions at face value and, as a result, may be throwing out the baby (good nutrient research) along with the bathwater.
So what’s to be done about this problem? “It’s not a complete fix, but some journals are moving away from p-values and encouraging researchers to use effect size, confidence intervals, meta-analyses—things that get away from a poor measuring stick to things that are partly better,” Nuzzo said.
And, she added, “I’m also seeing a new culture, and this is very exciting. In a research environment where things are underpowered, it’s encouraging [to see] more sharing, more transparency, more replication—replication both after the fact, encouraging more people to try to replicate studies instead of just chasing after the novel findings—but also pre-publication replication, where before you even submit to publication, you try to replicate and prove that [the effect] is real—and not just replicate once, but replicate in numerous, different ways with enough power.”
In conclusion, she said, “[T]his pseudo-objectivity, this idea that you can boil everything down to a number and use statistical inference as a substitute for scientific thinking, is just not true." She also encouraged a little bit of old-fashioned subjectivity. In other words, use your brain. "Smart human judgment—there’s no substitute for that," she said.
Sidebar: The Problem of “P-Hacking”
“P-hacking,” or the practice of manipulating data in a way that alters the p-value, is another problem that persists. Researchers have many choices to make during the course of a research study—looking at different subgroups and comparison endpoints, for instance. “Really, [there is] an infinite amount of choices and decisions along the way,” Nuzzo said, and all of these choices can end up affecting the p-value and raising the chance of a false positive. In this way, researchers are “exploiting, perhaps unconsciously, research [data] until the p-value is less than 0.05.”
“Essentially, if you torture the data, it will confess,” she said.
Photo © iStockphoto.com/video-doctor