Sampling and Bootstrap Distributions



Megan Ayers

Math 141 | Spring 2026
Friday, Week 6

ASA DataFest 2026

  • Are you interested in data science and looking for a challenge? DataFest is an exciting opportunity to work with real-world data, collaborate with peers, and gain valuable experience!

  • Friday April 17 - Sunday April 19 at Willamette

  • To learn more and sign up, visit https://my.willamette.edu/site/computer-science/data-fest

Goals for Today

  • Complete a worksheet comparing sampling and bootstrap distributions

  • Review our answers together

Activity

Instructions

  • In small groups, carefully complete the worksheet


  • When you are finished, submit on Gradescope for a small completion grade
    • You may work on HW 5 or midterm review if you have extra time!


  • We’ll come together to discuss with 10-15 minutes remaining

Question 1

deck <- data.frame(cards = rep(1:13, each = 4))

Q1: In Figure 1, why is the height of each bar 4? Describe this distribution.

  • There are four cards of each value in a deck.

  • The distribution has a “flat” shape, centered at 7 (mean = 7 and median = 7).

  • The distribution has a “large” spread. Standard Deviation: 3.78

Question 2

set.seed(1) # ensures we all get the same "random" sample
single_sample <- deck %>% rep_sample_n(size = 10, replace = FALSE, reps = 1)
single_sample$cards
 [1]  1 10  1  9  6 11  4  5  9  6

Q2: Based on the code, which cards are in our sample? Use the cards to calculate our sample statistic (the sample mean) based on this sample.

  • We have two Ace’s (1’s), a 4, 5, two 6’s, two 9’s, a 10, and a Jack (11).

  • The sample mean is 6.2.

Question 3

deck %>% 
  rep_sample_n(size = 10, 
               replace = FALSE, reps = 50000) %>%
  group_by(replicate) %>% 
  summarize(x_bar = mean(cards)) %>%
  ggplot(aes(x = x_bar)) + 
  geom_histogram(binwidth = 0.2) + theme_bw() +
  labs(x = "Mean Card Value from Many Samples",
       title = "Fig. 2: Sampling Distribution")

Q3: Based on the code above, how many cards are in each sample? How many different samples did we take to create the sampling distribution in Figure 2?

  • There are 10 cards in each sample, since size = 10.

  • We took 50,000 different samples, since reps = 50000.

Question 4

Q4: Figure 3 (below) displays sampling distributions for samples of size n=10, n=20, and n=40. How are they similar and how are they different? Why do larger samples have sampling distributions with less variability?

  • Bell-shaped and centered at the true mean, 7.

  • They differ in their spread: \(n=10\) distribution has the largest standard error; the \(n=40\) distribution has the smallest standard error.

Question 4

Q4: Figure 3 (below) displays sampling distributions for samples of size n=10, n=20, and n=40. How are they similar and how are they different? Why do larger samples have sampling distributions with less variability?

  • Why? If our sample size (\(n\)) is larger, we have more data and our sample should be “more representative” of the population.

  • i.e., more data means a better glimpse at the true population, and better guesses about the population mean!

Questions 5 and 6

bootstrap_sample <- single_sample %>% rep_sample_n(size = 10, replace = TRUE, reps = 1)
bootstrap_sample$cards
 [1] 11  6  9  1  5  6  1  1  9  9

Q5: In the code above, what are we sampling from (the population, or the single sample)? Are we sampling with replacement? What’s our sample size?

  • We’re sampling from the single sample (single_sample).

  • We’re sampling with replacement (replace = TRUE).

  • Our sample size is still 10 (size = 10)

Q6: How would your answers to Q5 be different if we were talking about sampling for a sampling distribution?

  • We sample from the population.

  • We sample without replacement.

  • Our sample size is still 10!

Question 7

single_sample %>% ungroup() %>% select(cards) %>%
  rep_sample_n(size = 10, replace = TRUE, reps = 20000) %>%
  group_by(replicate) %>% 
  summarize(x_bar = mean(cards)) %>%
  ggplot(aes(x = x_bar)) + geom_histogram(binwidth = 0.1) +
  labs(x = "Mean Card Value Based on Samples of Size 10",
       title = "Fig. 4: Bootstrap Distribution") + 
  theme_bw() +
  scale_x_continuous(breaks = 1:13, limits = c(1, 13))

Q7: Based on the code above, how many bootstrap samples are we taking to create the bootstrap distribution?

  • 20,000 bootstrap samples because reps=20000.

Question 8

Q8: Which sampling distribution looks most like the bootstrap distribution in Figure 4? How specifically is it similar or different? Consider the shape, center, and spread of each distibution.

Question 8

The \(n=10\) sampling distribution!

Question 8

Shape: All distributions look bell-shaped, so this doesn’t distinguish them at all.

Question 8

Center: Bootstrap is centered at the sample mean (6.2); all sampling distributions are centered at the population mean (7).

Question 8

Spread: The spread of the bootstrap distribution looks similar to the spread of the sampling distribution with \(n=10\).

Question 9

Q9: Thinking more generally, compare and contrast sampling distributions and bootstrap distributions.

  • Sampling distributions and bootstrap distributions both help us conceptualize the distribution of a statistic, for the purpose of understanding plausible values for a parameter.

  • Sampling distributions and bootstrap distributions should have similar spread (e.g., a similar standard deviation).

  • Their means should also be similar, although a sampling distribution is centered at the true parameter, while a bootstrap distribution is centered at the sample mean.

  • Sampling distributions are usually not possible to obtain (we can only take one sample from the population). Bootstrap distributions approximate a sampling distribution, and are super easy to obtain (especially with code).