False Positives
The relationship between POD (Probability of Detection) and False Positives depends on more than the inspection itself. It also depends on the frequency of defectives in the population being inspected.
Your NonDestructive Evaluation system signals a “hit.” Is it really a crack? Or is it a “false positive?” Such a simple question – such a complicated answer!
Consider these two, distinct inspection situations:
- You are performing an inspection on a test piece with known provenance:
- Sensitivity: “I know part has a defect. What is the probability that the test will be positive?” This is the conditional probability of a positive response, given a defect exists, P(+|defect). (Conditional probabilities are written with a vertical bar that separates the result from what it is conditioned on.)
- Specificity, P(-|no defect): “I know the part does NOT have a defect. What is the probability that the test will be negative?”
- You have performed an inspection on a part with uncertain history:
- Positive Predictive Value (PPV), P(defect|+): “I just got a positive test result from my NDE equipment. What is the probability that the part actually has a defect (of the size being inspected for)?”
- Negative Predictive Value (NPV), P(no defect|-): “I just got a negative test result from my NDE equipment. What is the probability that the part is defect-free?”
The first thing to understand is that Sensitivity and PPV are NOT the same, nor are specificity and NPV. Consider all possible outcomes of a generic inspection, summarized in Table 1:
defect present (+) | defect absent (-) | Totals | |
Test positive (+) | a | b | a + b |
Test negative (-) | c | d | c + d |
Totals | a + c | b + d | a+b+c+d |
Table 1: Contingency Table of Inspection Outcomes
We will consider two numerical examples. The first is a “good” inspection, with specificity = 90% and sensitivity also 90%. The second is a coin-toss representing a random “inspection” where both are 50%. In these examples (Tables 2 and 3) the frequency of defects in the population being inspected is 0.3%, the same as the prevalence of AIDS in the US. (See note 2, below.)
“Good” Inspection
defect present (+) | defect absent (-) | Totals | |
Test positive (+) | 27 (0.9) | 997 (0.1) | 1024 |
Test negative (-) | 3 (0.1) | 8973 (0.9) | 8976 |
Totals | 30 | 9970 | 10000 |
Table 2: Contingency Table of Inspection Outcomes
sensitivity, P(+|defect) | 0.9 | true positive |
specificity, P(-|no defect) | 0.9 | true negative |
PPV, P(defect|+) | 0.02637 | fraction (+), defect |
NPV, P(no defect|-) | 0.99967 | fraction (-), no defect |
This is unexpected! The conditional probability of a defect, given a “hit” is less than three-percent! How could that happen?
Here’s why: The population has a very small prevalence of defects, P(defect) = 0.003 (this is the prevalence of AIDS in the US) so the false calls (false positives), P(+|no defect), outnumber the true positives, P(+|defect). Thus the fraction of positives that actually have the defect is small. (This is why “screening” physicians for AIDS is a bad idea: 97% of those testing positive would not have AIDS, assuming the screening test has sensitivity = 90%. And re-testing wouldn’t improve the situation either, since the inspections would not be independent.)
Why bother to inspect? Look closely at the NPV, the Negative Predictive Value, the fraction correctly passed by the inspection. NPV=0.99967. The test is doing what it supposed to do (albeit helped considerably by the low defect rate). This inspection is about ten times more effective than a coin toss, as illustrated in Table 3.
Coin Toss (Random) Result
defect present (+) | defect absent (-) | Totals | |
Test positive (+) | 15 (0.5) | 4985 (0.5) | 5000 |
Test negative (-) | 15 (0.5) | 4895 (0.5) | 5000 |
Totals | 30 | 9970 | 10000 |
Table 3: Contingency Table of Possible Inspection Outcomes
sensitivity, P(+|defect) | 0.5 | true positive |
specificity, P(-|no defect) | 0.5 | true negative |
PPV, P(defect|+) | 0.003 | fraction (+), defect |
NPV, P(no defect|-) | 0.997 | fraction (-), no defect |
Result to Remember:
The sensitivity, POD|a (Probability of Detection, given target of size a, and the probability of a false call (= 1-specificity) depend only on the test, while the PPV (positive predictive value) and the NPV depend both on the test and the population being tested.
Receiver Operating Characteristic (ROC) Curve:
Changing the decision criterion (threshold) can improve the POD (sensitivity) but at the expense of increased false calls (diminished specificity). A plot of sensitivity vs. one – specificity, called a Receiver Operating Characteristic Curve, was popularized during World War II, and still has advocates today, in spite of the fact that it cannot consider the frequency of defectives in the population, and thus ignores PPV and NPV.
Why was the ROC effective in WWII but is hopelessly ineffective for contemporary inspections\(^{(1)}\)? In WWII the prevalence of targets in the general population was very high, say > 50%. (If you detected airplanes in bomber formation flying toward your coast they were unlikely to be friendly.) In contemporary inspections the prevalence of defects is very, very low. (3 per 1000 for AIDS\(^{(2)}\), for example; much lower for intrinsic material defects.) Thus the PPV (positive predictive value) in WWII was high, but in contemporary inspections, it is unacceptably low.
Notes:
- In spite of its deficiencies the ROC still has many advocates, largely because the literature has provided few alternatives, and because the underlying assumption of large prevalence is ignored. You ignore Mother Nature at your own peril.
- Centers for Disease Control and Prevention, “Healthy People 2000, Final Review,” 2001. The 0.3% prevalence of AIDS is an estimate: 800,000-900,000 persons infected with HIV (p254), US population is about 295 million. 900,000/(295 � 106) = 0.003.
Definitions:
- prevalence – The total number of cases of a disease in a given population at a specific time.
- incidence – The number of new cases of a disease in a population over a period of time.NDE engineers use the terms interchangeably to mean prevalence – Medical Doctors pay attention to the distinction.