Choosing the Right Distribution

Cataloging Statistical Distributions

The is an unfortunate belief by those new to the subject that probabilistic analysis is really quite simple, and if you choose the appropriate statistical distributions for all your random variables, you’re pretty much home free.

Since the survey course in engineering statistics presents a dataset, and then presents a standard probability density to describe it, most engineers are left with the mistaken idea that everything variable follows some catalogued probability distribution, and the entire purpose of statistics is to find which one.

Many random events do follow standard probability models, many of them normal. The mistake comes in believing they all do. This world-view is very narrow while seeming to be broad: If your catalogue has several dozen (univariate1) probability densities, then surely you can select the “best” one and get on with it.

A Real Counterexample:

Here is a real counterexample. The data are times between eruptions of Old Faithful.

While these data are not multivariate (there’s only one variable – time) they are multimodal, meaning they have more than one peak. Try to find that in your standard catalogue.

It is indeed possible to “fit” several standard probability distributions to these data. None, however, would provide a useful picture of reality. You could, for example, compute a sample mean and standard deviation for the eruption times. Clearly that would not make them normal, however. Sadly, many engineers don’t even plot their data versus their model to see how well it performs.

So What?

The philosopher George Santayana observed that “Fanaticism consists in redoubling your efforts when you have forgotten your aim.”2 Many engineers have become fanatics about “probabilistics,” redoubling their efforts to find the “right” distribution while forgetting their aim: describing reality in a useful way.

All is not lost, of course. It’s just not as simple as choosing a probability density from some master catalogue, ignoring the quality of the fit, and rushing to calculate tail probabilities based on the erroneous model.

Probabilistic simulation isn’t quite as easy as it may seem either. For example

Notes:

  1. Many things in nature are collectively (jointly) random and thus cannot be accurately described one-at-a-time.
  2. George Santayana, Life of Reason, Reason in Common Sense, Scribner’s, 1905.