“Fake Data”

Many decades ago, when I was a young engineer, I was called to explain why my analysis of some metal fatigue data was at odds with an analysis done by another engineer. It was quite clear that his presentation told a much prettier story than mine did.

In the meeting with my boss and his, it was immediately evident that his data contained many more observations than what I was working with. A closer look showed that the new data moved the descriptive regression curve in such a way as to present a rather tidy story. Mine suggested another interpretation. The other fellow continued his presentation:

“The old data are the real data,” he explained, and new stuff was “the Fake Data.” (Those were his words: “Fake Data.”) \(^1\) It wasn’t actual data, he went on, but how he thought the data would behave if he had actually run the experiments, which he didn’t, since he felt their outcome would be obvious. That ended the meeting. Fake data is fake. Fake data doesn’t count.

Sadly, such “engineering data augmentation” isn’t a once-in-a-decade occurrence.

When engineers generate “Fake Data” they usually aren’t trying to mislead. They believe, mistakenly, that they can predict, with certainty,\(^2\) how a series of physical experiments would turn out, and thus have no need for the time and expense to actually perform the experiments.

A recent example is adding “fake data” to NDE tests to “anchor” the Probability of Detection (POD) curve. It happens like this:

The NDE experiment finds everything, which means it was a poorly designed experiment and not capable of determining the region where finding a crack is related to crack size. To rescue the experiment, the well-meaning, if dishonest, engineer adds small sizes to the data since he “knows with certainty” that those would have been missed. (I’ve seen the same kind of thing with too few large cracks, so “fake data” long cracks and phantom detections were added since they “certainly” would have been found.)

What’s wrong with that?

The purpose of NDE experiments is to determine the relationship of POD with size. If the data are unable to describe that relationship and the associated confidence bounds, then more real data is required, because sometimes what might be anticipated simply isn’t what is observed when the experiment is actually performed. This is because of a POD “floor” and/or POD ceiling that cannot be detected without real data in those areas.

  1. “Fake Data” should not be confused with the statistical method of expectation maximization which can sometimes make up for observations that are missing-at-random. The technique is used to find maximum likelihood estimates in parametric models of incomplete data.
  2. Valid statistical analysis assesses the influence of randomness on the conclusions being drawn. “Fake data” used to “anchor the curve” destroys that validity, reducing it to guesswork. But because the result comes with an impressive computer printout, including “significance” levels, it looks official, when it is meaningless statistical rubbish.