Data Collection and Sampling

Megan Ayers

Math 141 | Spring 2026
Wednesday, Week 3

Announcements/Reminders

Goals for Today

Discuss principles of data collection/acquisition
Investigate 3 methods of drawing random samples

When to Get Coding Help

😢 “I have no idea how to do this problem.”

→ Ask someone to point you to a similar example from the lecture or readings.

→ Talk it through with a course assistant, a fellow Math 141 student, or Megan so together we can verbalize the process of going from Q to A.

😡 “I am getting a weird error but really think my code is correct/on the right track/matches the examples from class.”

→ It is time for a second pair of eyes. Don’t stare at the error for over 10 minutes.

🤩 And lots of other times too! 😬

When to Get Help

Remember:

→ Struggling is part of learning.

→ But let us help you ensure it is a productive struggle.

→ Struggling does NOT mean you are bad at stats, it actually means you are doing the work to learn the material!

Now for Data Collection

Who are the data supposed to represent?

Every statistical investigation should clearly identify and compare:

The population to be studied
The sample from which measurements (data) will be taken

Key questions:

What evidence is there that the sample is representative of the population?
Who is present? Who is absent?
Who is overrepresented? Who is underrepresented?

Who are the data supposed to represent?

In a census, we have data on the entire population!

But usually, we don’t have the money, time, or ability to do this.

Who are the data supposed to represent?

Instead, we use a sample of the population, and use the sample to draw conclusions about the population.

Who are the data supposed to represent?

Key questions:

What evidence is there that the sample is representative of the population?
Who is present? Who is absent?
Who is overrepresented? Who is underrepresented?

Sampling Bias

Sampling bias: When certain individuals are more likely to be sampled than others

Sampling Bias

Sampling bias: When certain individuals are more likely to be sampled than others

Q: Consider a telephone poll for an election - where might we get sampling bias?

Non-response: individual can’t or won’t contribute
Undercoverage: some groups are less likely to be called
Inaccurate response
Self-selection: membership in the sample is voluntary
Convenience: selecting a convenient but non-representative block to sample

Sampling Bias Example

The Literary Digest was a political magazine that correctly predicted the presidential outcomes from 1916 to 1932. In 1936, they conducted the most extensive (to that date) public opinion poll. They mailed questionnaires to over 10 million people (about 1/3 of US households) whose names and addresses they obtained from telephone books and vehicle registration lists.

Population of Interest:

Sample:

Sampling bias:

Sampling Bias Example

We want to know how Portlanders feel about a new coffee shop in Woodstock.

The coffee shop has a Yelp rating of 3.5/5 stars with 10 reviews.

Q1: Can we conclude that a typical Portlander would rate this coffee shop at 3.5 stars?

Q2: What sources of bias are present in this sample?

Q3: A year later, the coffee shop still has 3.5 stars, but 1000 reviews. Does the verdict change?

Q4: A second coffee shop opens up nearby with a Yelp rating of 4 stars and 1000 reviews. Can we conclude Portlanders prefer the second restaurant to the first?

Random Sampling

Use random sampling (a random mechanism for selecting cases from the population) to remove sampling bias.

Types of random sampling

We’ll explore 3 types of random sampling

Simple random sampling
Cluster sampling
Stratified random sampling

Simple Random Sampling

Motivating question: what is the average amount of student loan debt in Oregon?

Simple Random Sampling: Imagine that a unique ID for each student in the population is written on a slip of paper…

Shuffle the slips of paper in a bowl
Draw \(n\) IDs/slips one-by-one to create a sample

Simple Random Sampling

Motivating question: what is the average amount of student loan debt in Oregon?

Simple Random Sampling: Imagine that a unique ID for each student in the population is written on a slip of paper…

Shuffle the slips of paper in a bowl
Draw \(n\) IDs/slips one-by-one to create a sample

Simple Random Sampling

Consequences:

Every member of the population has an equal chance of being selected for the sample
There is no inherent correlation between any two members of the sample

Q: Can a simple random sample be non-representative?

A: Yes, even if all goes as planned!

For large sample sizes, it’s unlikely
The sample will be representative on average

Q: Why aren’t all samples generated using simple random sampling?

Simple Random Sampling

Advantages:

Relatively simple to interpret and analyze
Non-biased (in theory)

Disadvantages:

May not be as “precise” as other sampling techniques
Can be difficult to perform in practice

Stratified Random Sampling

Motivating question: what is the average amount of student loan debt in Oregon?

Stratified Random Sampling: “Strata” are made up of similar individuals, then simple random samples are taken from each stratum.

e.g. define strata based on public vs private college and family income ranges

Stratified Random Sampling

Stratified Random Sampling: “Strata” are made up of similar individuals, then simple random samples are taken from each stratum.

Advantages:

Can be more “precise” than simple random sampling, requiring lower sample size
Hedges against non-representative samples

Disadvantages:

Statistical analysis is more complex
Strata creation isn’t always straightforward (need additional data, and to have compelling reasons for strata definitions)

Cluster Random Sampling

Motivating question: what is the average amount of student loan debt in Oregon?

Cluster Random Sampling: “Clusters” are non-homogeneous. We take a simple random sample of the clusters, and use all observations in those clusters as the sample.

e.g. we take a simple random sample of schools, include all students in those schools in the sample

Cluster Random Sampling

Cluster Random Sampling: “Clusters” are non-homogeneous. We take a simple random sample of the clusters, and use all observations in those clusters as the sample.

Advantages:

Useful when it’s difficult/impossible to exhaustively list the population
Often more time/cost effective per \(n\)
Useful when population is naturally concentrated in heterogeneous groups

Disadvantages:

Less precise than simple or stratified sampling
Statistical analysis is more complicated
Natural clusters may not always exist

National Health and Nutrition Examination Survey

Mission: “Assess the health and nutritional status of adults and children in the United States.”

How are these data collected?

NHANES Sampling Design

Stage 1: US is stratified by geography and distribution of minority populations. Counties are randomly selected within each stratum.
Stage 2: From the sampled counties, city blocks are randomly selected. (City blocks are clusters.)
Stage 3: From sampled city blocks, households are randomly selected. (Households are clusters.)
Stage 4: From sampled households, people are randomly selected. For the sampled households, a mobile health vehicle goes to the house and medical professionals take the necessary measurements.

Why don’t they use simple random sampling?

Careful Using Non-Simple Random Sample Data

Detour: Data Ethics

Data Ethics

“Good statistical practice is fundamentally based on transparent assumptions, reproducible results, and valid interpretations.” – Committee on Professional Ethics of the American Statistical Association (ASA)

The ASA has created “Ethical Guidelines for Statistical Practice”

→ These guidelines are for EVERYONE doing statistical work.

→ There are ethical decisions at all steps of the Data Analysis Process.

→ We will periodically refer to specific guidelines throughout this class.

“Above all, professionalism in statistical practice presumes the goal of advancing knowledge while avoiding harm; using statistics in pursuit of unethical ends is inherently unethical.”

Responsibilities to Research Subjects

“The ethical statistician protects and respects the rights and interests of human and animal subjects at all stages of their involvement in a project. This includes respondents to the census or to surveys, those whose data are contained in administrative records, and subjects of physically or psychologically invasive research.”

Responsibilities to Research Subjects

Why do you think the Age variable maxes out at 80?

“Protects the privacy and confidentiality of research subjects and data concerning them, whether obtained from the subjects directly, other persons, or existing records.”

Next time:

Study design