
Simpson's Paradox
Simpson's Paradox is a situation in statistics where a trend appears in different groups of data but reverses or disappears when the groups are combined.
Imagine two tutoring centers, Center A and Center B, are helping students pass an exam. Overall, Center B has a higher pass rate than Center A:
| Center | # Students | # Passed | Pass Rate |
|---|---|---|---|
| A | 100 | 43 | 43% |
| B | 100 | 62 | 62% |
At first glance, Center B seems to be more successful. But, let's divide the students into two groups: those taking the exam for the first time and those who are retaking it. Among first-timers, Center A has a higher pass rate. Among repeat takers, Center A also has a higher pass rate.
| Group | Center | # Students | # Passed | Pass Rate |
|---|---|---|---|---|
| First-time takers | A | 80 | 28 | 35% |
| First-time takers | B | 20 | 6 | 30% |
| Repeat takers | A | 20 | 15 | 75% |
| Repeat takers | B | 80 | 56 | 70% |
The reason for this apparent contradiction is that Center A has a higher percentage of first-time test takers. Since first-time test takers are less likely to pass, Center A has a lower pass rate overall.
Simpson's Paradox teaches us that it's important to analyze subgroups within data for possible hidden variables.
To see a real-life example of Simpson's paradox, read my blog post about The Kidney Conundrum.