Welcome to DU! The truly grassroots left-of-center political community where regular people, not algorithms, drive the discussions and set the standards. Join the community: Create a free account Support DU (and get rid of ads!): Become a Star Member Latest Breaking News General Discussion The DU Lounge All Forums Issue Forums Culture Forums Alliance Forums Region Forums Support Forums Help & Search

phantom power

(25,966 posts)
Mon Aug 20, 2012, 05:02 PM Aug 2012

Live by statistics, die by statistics

Pretty interesting result -- scientists doing some science on themselves

There is a magic and arbitrary line in ordinary statistical testing: the p level of 0.05. What that basically means is that if the p level of a comparison between two distributions is less than 0.05, there is a less than 5% chance that your results can be accounted for by accident. We’ll often say that having p<0.05 means your result is statistically significant. Note that there’s nothing really special about 0.05; it’s just a commonly chosen dividing line.

Now a paper has come out that ought to make some psychologists, who use that p value criterion a lot in their work, feel a little concerned. The researchers analyzed the distribution of reported p values in 3 well-regarded journals in experimental psychology, and described the pattern.

The circles represent the actual distribution of p values in the published papers. Remember, 0.05 is the arbitrarily determined standard for significance; you don’t get accepted for publication if your observations don’t rise to that level.



Notice that unusual and gigantic hump in the distribution just below 0.05? Uh-oh.

I repeat, uh-oh. That looks like about half the papers that report p values just under 0.05 may have benefited from a little ‘adjustment’.

What that implies is that investigators whose work reaches only marginal statistical significance are scrambling to nudge their numbers below the 0.05 level. It’s not necessarily likely that they’re actually making up data, but there could be a sneakier bias: oh, we almost meet the criterion, let’s add a few more subjects and see if we can get it there. Oh, those data points are weird outliers, let’s throw them out. Oh, our initial parameter of interest didn’t meet the criterion, but this other incidental observation did, so let’s report one and not bother with the other.

http://scienceblogs.com/pharyngula/2012/08/13/live-by-statistics-die-by-statistics/


Notice that if you are reading a bunch of science papers reporting p-values around 0.5%, the implication is that about 1/20 of those papers got a spurious result! The only question is, which ones...

Another fun fact about the famous "0.05 pval threshold": Suppose you are running an algorithm (like decision tree model training) that does *many* p-value tests. For example, training a single interior tree node split can easily involve thousands of such tests on a large data set, each of which has a 5% chance of being spurious, not "true." A pval threshold of 0.05 is a common parameter in such training algorithms, and if you run the numbers on this, you find that the odds of your chosen split being "truly significant" can be a lot less than 95%. An adjustment such as the Bonferroni adjustment is often employed to compensate for this problem.


9 replies = new reply since forum marked as read
Highlight: NoneDon't highlight anything 5 newestHighlight 5 most recent replies

tblue

(16,350 posts)
1. So sorry. You lost me at "p level of 0.05."
Mon Aug 20, 2012, 05:07 PM
Aug 2012

I got no idea what this is. I might even totally agree with it, but I'd have to have it explained to me first.

You must have a lot of brilliant friends. Good for you, really! I have a BA and an MBA, but this is a few grade levels above me.

phantom power

(25,966 posts)
2. I think I can explain:
Mon Aug 20, 2012, 05:18 PM
Aug 2012

A lot of science experiments are designed this way: "I'm going to collect two samples of data (under two different conditions). If my theory is correct, those two samples will have a different average. Or, a different standard deviation, etc."

So, you can imagine: if you collect two samples like that, there is *some* probability that by bad luck, you'll get two different averages by random chance, and you'll be reporting that your theory is correct when it actually wasn't.

There's an entire (enormous) sub-field of statistics that does nothing except provide mathematical ways for us to measure that probability we were just unlucky. That's what is often called the p-value. So if you measure a p-value of 0.05, then it's saying the probability is 5% (1/20) that you just got unlucky.

As you can see, you'd like your experiments to give you p-values as small as you can get them, because that means your probability of getting a spurious result is as small as possible. The particle physics guys typically won't say they've "confirmed" a new particle until their statistics are reporting p-values of 0.000001, or something. Other branches of science make do with larger p-values like 0.05, partly because smaller sample sizes result in larger p-values and collecting very large samples in things like psychology, or biology, or crashing cars for safety measures, is sometimes impossible to do.

drm604

(16,230 posts)
3. Could this at least partly be an artifact of choosing whether or not to publish?
Mon Aug 20, 2012, 05:26 PM
Aug 2012

Maybe researchers are less likely to publish "statistically insignificant", i.e. p>=0.05, results. That could account for the sudden drop at higher values. Of course the sudden drop to the left of the 'hump" isn't explained by that.

phantom power

(25,966 posts)
4. tentatively, I don't think so, because...
Mon Aug 20, 2012, 05:38 PM
Aug 2012

if it was a thing like "we aren't going to publish unless it's < 0.05" then I would expect to see all the dots to the left of 0.05 being measurably higher than the black line of prediction.

As Myers points out, there are a lot of non-deliberate ways this might be happening, but that are technically forms of cooking the data. One thing that could be happening is that you collect some data. You find you didn't quite make 0.05. you collect some more, and retest, and keep doing that. There isn't really anything wrong with that, however it happens that this injects a kind of experimental bias, because each time you run the experiment, you are increasing your odds of getting the p-value you want. The Bonferroni adjustment is meant to try and correct for that. If you're *really* being a stickler, you would apply the Bonferroni adjustment to account for how many times you ran your experiment before you reported your final p-value.

To give a better answer, I'd have to read more about how the authors collected *their* data

bananas

(27,509 posts)
6. The mystery may be solved: "APA guidelines instruct ..."
Mon Aug 20, 2012, 08:29 PM
Aug 2012
http://bps-research-digest.blogspot.com/2012/08/phew-made-it-how-uncanny-proportion-of.html

Tarmo Toikkanen 3:25 PM

Umm, APA guidelines instruct to use the prechosen signifigance value, most often 0.05 in the text, rather than the exact p-value. So this result is obvious and misleading.

Jim__

(14,075 posts)
8. There is a reply to that comment that states the study only included papers that reported the exact
Mon Aug 20, 2012, 09:28 PM
Aug 2012

p value.

Christian Jarrett3:51 PM

Thanks for this. Only papers that quoted the precise p value were included in the analysis. I'll add a clarification to make that clear.
 

Thor_MN

(11,843 posts)
9. With a bunch of statiticians P ing all over the place
Wed Aug 22, 2012, 09:02 AM
Aug 2012

there's bound to be some splashes near the toilet.

Latest Discussions»Culture Forums»Science»Live by statistics, die b...