

Study Design
Megan Ayers
Math 141 | Spring 2026
Friday, Week 3
The Mathematics and Statistics Department is currently in the process of drafting a schedule for next academic year. In order to have a better sense of how many sections of each class to offer, we would like to know your plans for next year.


Key questions:

Nonresponse bias: The respondents are systematically different from the non-respondents for the variables of interest.

Nonresponse bias: The respondents are systematically different from the non-respondents for the variables of interest.
Of the 10 million people surveyed, more than 2.4 million responded with 57% indicating that they would vote for Republican Alf Landon in the upcoming presidential election instead of the current President Franklin Delano Roosevelt.
Non-response bias? Sample creation issues?

Use multiple modes (mail, phone, in-person) and multiple attempts for reaching sampled cases.
Explore key demographic variables to see how respondents and non-respondents vary.
In survey statistics, we can create survey weights to adjust for potential nonresponse bias.
For our Literary Digest Example, Gallup predicted Roosevelt would win based on a survey of 50,000 people (instead of 2.4 million).
Quality over quantity!
Random sampling is important to ensure the sample is representative of the population.
Representativeness isn’t about size.
However, I bet most samples you will encounter won’t have arisen from a random mechanism.
How do we draw conclusions about the population from non-random samples?
Descriptive: Want to estimate quantities related to the population.
→ How many trees are in the Amazon?
Predictive: Want to predict the value of a variable.
→ Can I use remotely sensed data to predict forest types in the Amazon?
Causal: Want to determine if changes in a variable cause changes in another variable.
→ Do financial contracts prevent people from deforesting their land in the Amazon?
For these goals will differentiate between the roles of the variables:
Response variable: Variable I want to better understand
Explanatory/predictor variables: Variables I think might explain/predict the response variable
Q: What is the role of each variable for each goal?
→ How many trees are in the Amazon?
→ Can I use remotely sensed data to predict forest types in the Amazon?
→ Do financial contracts prevent people from deforesting their land in the Amazon?
Random assignment: Cases are randomly assigned to levels of the explanatory variable
Example: COVID Vaccine Trials
To study the effectiveness of the Moderna vaccine (mRNA-1273), researchers carried out a study on over 30,000 adult volunteers with no known previous COVID-19 infection. Volunteers were randomly assigned to either receive two doses of the vaccine or two shots of saline. The incidence of symptomatic COVID-19 was 94% lower in those who received the vaccine than those who did not.
Question: Why does random assignment allow us to conclude that this vaccine was effective at preventing (early strains of) COVID-19?
We have data on the number of Methodist ministers in New England and the number of barrels of rum imported into Boston each year. The data range from 1860 to 1940.

Confounding variable: A third variable that is associated with both the explanatory variable and the response variable.
Unclear if the explanatory variable or the confounder (or some other variable) is causing changes in the response.
→ “Correlation does not imply causation.”
→ “Correlation does not imply not causation.”
Correlation does not imply causation … but sometimes there’s causation!
A study in which the researchers don’t actively control the value of any variable, but simply observe the values as they naturally exist.
Example: Hand washing study
A study in which the researcher actively controls one or more of the explanatory variables through random assignment.
Example: COVID Trial
Common features:
Random assignment allows you to explore causal relationships between your explanatory variables and the predictor variables by removing the possibility of a confounding variable
How do we draw causal conclusions from studies without random assignment?
But also consider the goals of your analysis. Often the research question isn’t causal.
Bottom Line: We often have to use imperfect data to make decisions. But whenever possible, use random assignment or find pseudo-random contexts for the strongest causal claims.
John Snow was a 19th century physician who is considered a founder of modern epidemiology. In 1854, he was investigating the drivers of a cholera epidemic in London. He suspected that contaminated water was the key cause. He realized there seemed to be no rhyme or reason to the water source that homes were connected to, and that one source was near sewage collection points, and the other further upstream. By surveying residents of homes across London, Snow created a data set that allowed him to compare cholera outcomes between those with water servicing from each source.