p-values

p-values are dangerous, especially large, small, and in-between ones.”

– Frank E. Harrell Jr., Prof. of Biostatistics and Department Chair, Vanderbilt University

 

No topic in Engineering Statistics is as poorly understood as “p-values.”  It is even more widely misused in other scientific endeavors, such as psychology and even medicine.  And the practitioners (we engineers, psychologists and physicians) are not to blame – the fault lies entirely with statistical educators who insist on teaching the comparatively useless and outmoded concept of Hypothesis Testing, rather than the utilitarian, and much more easily understood, concepts of parameter estimation with confidence intervals.

The sad thing is that many scientific journals (even some well respected ones) won’t accept an article unless it has the obligatory p-value to demonstrate that the result is “significant.”  Often what the result actually is (the size of the improvement in pharmacological efficacy, for example) is subsumed by all the folderol over its associated p-value.  And when the value of interest is reported, seldom is a confidence interval provided that would tell the reader how strongly to believe what is reported.

We are letting the tail wag the dog here, and it is time to re-order our intellectual priorities from being obsessed with meaningless hypothesis tests to concentrating on the magnitude and believability of the thing we are actually interested in.

Every engineer has suffered through Engineering Statistics 101, and learned several things:

  • statistics is inhumanly boring,
  • statistics is memorizing stuff you could easily look up since it’s all cook-book anyway,
  • a smaller p-value is better because a p-value is the probability that the null hypothesis is true.

I hope to convince you that all of these are untrue.  Two observations of W. Edwards Deming come to mind:

  • “Under the usual teaching, the trusting student, to pass the course, must forsake all the scientific sense that he has accumulated so far, and learn the book, mistakes and all.”
  • “Small wonder that students have trouble [with statistical hypothesis testing].  They may be trying to think.”

An interesting analog to p-value misunderstanding is R.P. Carver’s 1978 parable:

“What is the probability of obtaining a dead person (D) given that the person was hanged (H); that is, in symbol form, what is p(D|H)?  Obviously, it will be very high, perhaps .97 or higher.

“Now, let us reverse the question: What is the probability that a person has been hanged (H) given that the person is dead (D); that is, what is p(H|D)?  This time the probability will undoubtedly be very low, perhaps .01 or lower.

“No one would be likely to make the mistake of substituting the first estimate (.97) for the second (.01); that is, to accept .97 as the probability that a person has been hanged given that the person is dead.

“Even thought this seems to be an unlikely mistake, it is exactly the kind of mistake that is made with the interpretation of statistical significance testing – by analogy, calculated estimates of p(H|D) are interpreted as if they were estimates of p(D|H), when they are clearly not the same.” – (Carver 1978)

Said differently: It is wrong to interpret a p-value as the probability that the null hypothesis is true.

Well, if it doesn’t mean that, what does it mean?  Answer: Not much.  In classical hypothesis testing you must declare beforehand what “significance level” you require.  By convention (and not by celestial edict) that is 0.05, which is to say we would expect only 5 in 100 outcomes to be as extreme as what is observed, IF the null hypothesis is true.  (The null hypothesis, H0, is what you don’t want to be true. – Is it any wonder engineers hate statistics?)  So you observe a p-value of 0.001.  Since that is more extreme than p=0.05, you conclude that it is unlikely (at the 5% level) that H0 is true.

You cannot conclude that the probability that H0 is true is 0.001, which is why you must pre-declare your desired significance level.  WAIT!  How can it possibly matter when I decide on significance?!  Ah!  but it does.  If you want to determine the probability that your alternative hypothesis, Ha, is true, then you must become a Bayesian.  Frequentist hypothesis testing is only useful for assessing the behavior of what you don’t think is true to begin with.

Now, why fool around with that kind of statistical double-talk?  Forget about p-values and hypothesis testing.  Instead, estimate the most likely value of what you are interested in (a parameter, or a difference, say) and then compute its confidence interval.

If the interval includes zero, you would infer that the results could have reasonably happened by chance, but in any event you will have a value for something you are interested in and you will have a handle on how seriously you should believe it.

Notes:

  • William Edwards Deming (1900 – 1993) was an American statistician, professor, author, lecturer, and consultant, perhaps best known for his work helping rebuild Japanese industry after WWII through the application of statistical methods, and thus changing the image of “made in Japan” from meaning “cheap” to meaning “high quality.”
  • W. E. Deming, (1975)  “On probability as a basis for action.”  American Statistician 29: 146-152.
  • Carver, R.P. (1978) “The case against statistical testing.” Harvard Educational Review 48: 378-399.
  • Ziliak and McCloskey (2008) The Cult of Statistical Significance – How the Standard Error Costs Us Jobs, Justice, and Lives, University of Michigan Press.  While I do not agree with the authors in their vilification of R.A. Fisher (whom I believe they misunderstand) I do concur with their fundamental thesis that hypothesis testing has become a cult.