Other Goodness-of-Fit Measures

R2 isn’t the only way to judge a model’s performance, although it is one of the most common. In standard situations, where the residual errors are uncorrelated and normally distributed, it can provide insight concerning the model’s utility.

But what of the many other real-world situations where either the residual errors (what’s left after the model has done its job) are not normal, or the errors are not independent, or both?

First consider the choice of mean squared error. By its nature it will compromise by making the many smaller deviations larger so that the few larger ones (squared, so they’re really big now) can be made smaller. The the resulting fit will be worse for many observations, but still “better” overall. Is that what’s best for your situation? (It may be.)

You may, however, prefer a “poorer” fit that makes a few larger errors even larger, but produces a better fit for the great majority of cases. In this situation the range of the model errors may be about the same, but most are clustered nearer to zero, a very desirable situation. But since R2 is the fraction of squared error explained by the model, the model will have a lower R2, but superior performance. What’s “better” or “worse” depends on your situation.

Then there is the problem of “outliers,” which are discussed elsewhere. These extreme observations can often ruin an otherwise good model because the least-squares criterion for determining the model parameters may be inappropriate. (The easy remedy – discard them – should be studiously avoided because they often are the most informative members of the dataset.)

There are several alternatives to ordinary least squares (OLS) that choose model parameters based on the central location of the data (the median and similar measures) rather than on the mean. These “MLE-like” estimators are not as efficient as MLEs, but are far less sensitive to extreme observations. (Statisticians, who have to have a word for everything, say they are more “robust.” You can almost see the rosy cheeks, can’t you.) While in many cases these would not be worth the additional difficulty in using them (they require special software, and greater care in model building), in cases where OLS is inadequate, these more sophisticated techniques are preferable.

Also to be considered are those situations where the errors are autocorrelated, that is, measurements closer together in time, or in physical distance, tend to be more similar than those further removed. OLS won’t work here either, but for different reasons. While the requirement for normal behavior of the residuals (“errors”) may still hold, the other necessary condition, that they be independent, does not. What then? There are two entire disciplines of applied statistics devoted to these topics: times series analysis, and spatial statistics or random fields.

In all of these situations R2 is an inappropriate metric, and models selected using it will necessarily suffer.


NOTE: The Anderson-Darling statistic is often used to assess the efficacy of a probability density model fit.