

P-Value Pitfalls
Megan Ayers
Math 141 | Spring 2026
Friday, Week 8
A hearty p-values discussion
Zoom out on statistical inference so far
Motivate theory-based inference
The original intention of the p-value was as an informal measure to judge whether or not a researcher should take a second look.
But to create simple statistical manuals for practitioners, this informal measure quickly became a rule: “p-value < 0.05” = “statistically significant”.
What were/are the consequences of the “p-value < 0.05” = “statistically significant” rule?
A consequence: Researchers often put too much weight on the p-value and not enough weight on their domain knowledge/the plausibility of their conjecture.
Read and discuss this xkcd comic with your neighbors

A consequence: P-hacking: Cherry-picking promising findings that are beyond this arbitrary threshold.

A consequence: P-hacking: Cherry-picking promising findings that are beyond this arbitrary threshold.

A consequence: P-hacking: Cherry-picking promising findings that are beyond this arbitrary threshold.

Distinguish the purpose of your analysis: confirmatory vs exploratory
Always present statistical context behind your conclusions: confidence intervals, p-values, analysis decisions, and assumptions
Multiple testing corrections
Example: A Nature study of 19,000+ recently married people found that those who meet their spouses online…
Are less likely to divorce (p-value < 0.002)
Are more likely to have high marital satisfaction (p-value < 0.001)
BUT the estimated effect sizes were tiny.
Q: Do these results provide compelling evidence that one should change their dating behavior?
The American Statistical Association created a set of principles to address misconceptions and misuse of p-values:
P-values can indicate how incompatible the data are with a specified statistical model.
P-values do not measure the probability that the studied hypothesis is true.
Scientific conclusions and business or policy decisions should not be based only on whether or not a p-value passes a specific threshold (i.e. 0.05).
Proper inference requires full reporting and transparency.
A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.
Despite its issues, p-values are still quite popular and can still be a useful tool when used properly.
In 2014, George Cobb a professor from Mount Holyoke College posed the following questions (and answers):

Understanding p-values and being able to interpret a p-value in context is a learning objective of Math 141.
Understanding that a small p-value means we have some evidence for \(H_a\) is important.
Understanding that a small p-value alone does not imply practical significance.
Understanding that what you mean by small should depend on your field and whether a Type I Error or Type II Error is worse for your particular research question.
Your ability to tell if a # is less than 0.05 is not a learning objective for Math 141.
Statistical inference is the process of drawing conclusions about a population based on sample data.
Q: How do point estimates and confidence intervals help us make inferences?
Q: How does the hypothesis testing framework help us make inferences?
Working with samples provides a window into the population, but we can’t see it all
Overarching goal: distinguish between results that arise due to chance vs results that reflect a real, systematic pattern in the population
This led us to focus on sampling distributions, which we approximate using bootstrap distributions and resampling methods
We used these to accomplish two tasks:
Point + interval estimates: Our best guess + range of plausible values for a parameter
Hypothesis testing: A scientific method for maintaining or rejecting a null hypothesis in favor of an alternative hypothesis
These two approaches are two sides of the same coin!
These two approaches are two sides of the same coin!
Suppose we have \(H_0: \mu = 0\), and observe a test statistic
Performing 2-sided hypothesis test with \(\alpha = 0.05\) is equivalent to computing a 95% confidence interval (CI) from our data and seeing if 0 falls inside it.

These two approaches are two sides of the same coin!
Suppose we have \(H_0: \mu = 0\), and observe a test statistic
Performing 2-sided hypothesis test with \(\alpha = 0.05\) is equivalent to computing a 95% confidence interval (CI) from our data and seeing if 0 falls inside it.

These two approaches are two sides of the same coin!
Suppose we have \(H_0: \mu = 0\), and observe a test statistic
Performing 2-sided hypothesis test with \(\alpha = 0.05\) is equivalent to computing a 95% confidence interval (CI) from our data and seeing if 0 falls inside it.

These two approaches are two sides of the same coin!
Suppose we have \(H_0: \mu = 0\), and observe a test statistic
Performing 2-sided hypothesis test with \(\alpha = 0.05\) is equivalent to computing a 95% confidence interval (CI) from our data and seeing if 0 falls inside it.


Question: How did folks do inference before computers?

Question: How did folks do inference before computers?

Question: How did folks do inference before computers?

Question: How did folks do inference before computers?
Motivating question: How can we use theoretical probability models to approximate our (sampling) distributions?

Before we can answer that question and apply the models, we need to learn about the theoretical probability models themselves. We’ll turn to this after spring break!