Linear Models III: Categorical Predictors

Megan Ayers

Math 141 | Spring 2026
Friday, Week 4

Reminders/Announcements

Please fill out the Week 4 feedback survey (link in Slack)
Wednesday: we will have a short and completion-based learning check about coefficient interpretation. Please arrive on time!

Goals for Today

Recap: Simple linear regression model
Broadening our idea of linear regression

Regression with a single, binary categorical explanatory variable
Regression with a single categorical explanatory variable with more than 2 levels

Simple Linear Regression

So far we’ve considered this model when:

Response variable \((y)\): quantitative
Explanatory variable \((x)\): quantitative
- Have only ONE explanatory variable.
AND, \(f()\) can be approximated by a line:

\[ \begin{align} y &= \beta_0 + \beta_1 x + \epsilon \end{align} \]

Linear Regression

Linear regression is a flexible class of models that allow for:

Both quantitative and categorical explanatory variables.
Multiple explanatory variables.
Where the response variable is quantitative.

Activity: Brunch Wait Times

You’re planning when to head to brunch, and want to understand how long you should expect to wait for a table. Info from your previous experiences is below:

Day	Wait (min)	Arrival
1	10	Early
2	0	Early
3	5	Early
4	10	Late
5	20	Late
6	15	Late

Q1: How long did you typically wait for a table?
Q2: You think wait time varies by arrival time. Calculate average wait by arrival time (early or late).
Q3: How much longer did you wait when you arrived late rather than early, on average?
Q4: Q3 can be re-framed as a simple linear regression! What are the explanatory and response variables? How is this different from regressions we’ve seen so far?

Activity: Brunch Wait Times

Day	Wait (min)	Arrival
1	10	Early
2	0	Early
3	5	Early
4	10	Late
5	20	Late
6	15	Late

The simplest model would predict wait time using a constant value:

\[ y = \beta_0 + \epsilon \]

Q: What \(\widehat{\beta}_0\) would minimize the sum of the squared residuals?
A: \(\widehat{\beta}_0 = \bar{y}\), the sample mean!

We can make a slightly more complicated model:

\[ y = \beta_0 + \beta_1x_\text{(arrived late)} + \epsilon \] where \(x_\text{(arrived late)}\) is either 0 or 1.

\[ \widehat{\text{Wait time}} = 5 + 10*x_\text{(arrived late)} \]

Linear Models with a Categorical Explanatory Variable with 2 Levels

Response variable \((y)\): quantitative
Have 1 categorical explanatory variable \((w)\) with two categories
\(y\) is quantitative: so we need to convert \(w\) into a numeric variable. Call this \(x\), taking either the value 0 or 1.

y	w	x
10	level A	0
15	level A	0
25	level B	1
7	level A	0
20	level B	1

Model form:

\[ \begin{align} y &= \beta_0 + \beta_1 x + \epsilon \end{align} \]

We often refer to categorical variables in linear models as factors
Think of \(x\) like a switch: our prediction changes depending on whether it’s turned on (\(x = 1\) when \(w = \text{level B}\)) or turned off (\(x = 0\) when \(w = \text{level A}\))

Example: Halloween Candy

candy <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/candy-power-ranking/candy-data.csv")
glimpse(candy)

Rows: 85
Columns: 13
$ competitorname   <chr> "100 Grand", "3 Musketeers", "One dime", "One quarter…
$ chocolate        <dbl> 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
$ fruity           <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1,…
$ caramel          <dbl> 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
$ peanutyalmondy   <dbl> 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ nougat           <dbl> 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
$ crispedricewafer <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ hard             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,…
$ bar              <dbl> 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
$ pluribus         <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1,…
$ sugarpercent     <dbl> 0.732, 0.604, 0.011, 0.011, 0.906, 0.465, 0.604, 0.31…
$ pricepercent     <dbl> 0.860, 0.511, 0.116, 0.511, 0.511, 0.767, 0.767, 0.51…
$ winpercent       <dbl> 66.97173, 67.60294, 32.26109, 46.11650, 52.34146, 50.…

What might be a good categorical explanatory variable of winpercent?

Exploratory Data Analysis

Before building the model, let’s explore and visualize the data!

Q: What dplyr functions should we use to find the mean and sd of winpercent by the categories of chocolate?
Q: What graph should we use to visualize the winpercent scores by chocolate?

Exploratory Data Analysis

# Summarize
candy %>%
  group_by(chocolate) %>%
  summarize(count = n(),
            mean_win = mean(winpercent), 
            sd_win = sd(winpercent))

# A tibble: 2 × 4
  chocolate count mean_win sd_win
      <dbl> <int>    <dbl>  <dbl>
1         0    48     42.1   10.2
2         1    37     60.9   12.8

Exploratory Data Analysis

ggplot(candy, aes(x = factor(chocolate), 
                   y = winpercent, 
                  fill = factor(chocolate))) +
  geom_boxplot() +
  stat_summary(fun = mean,
               geom = "point",
               color = "yellow",
               size = 4) +
  guides(fill = "none") +
  scale_fill_manual(values =
                      c("0" = "deeppink",
                        "1" = "chocolate4")) +
  scale_x_discrete(labels = c("No", "Yes"),
                   name =
          "Does the candy contain chocolate?")

Exploratory Data Analysis

ggplot(candy, aes(x = factor(chocolate), 
                   y = winpercent, 
                  fill = factor(chocolate))) +
  geom_boxplot() +
  stat_summary(fun = mean,
               geom = "point",
               color = "yellow",
               size = 4) +
  guides(fill = "none") +
  scale_fill_manual(values =
                      c("0" = "deeppink",
                        "1" = "chocolate4")) +
  scale_x_discrete(labels = c("No", "Yes"),
                   name =
          "Does the candy contain chocolate?")

Exploratory Data Analysis

ggplot(candy, aes(x = factor(chocolate), 
                   y = winpercent, 
                  fill = factor(chocolate))) +
  geom_boxplot() +
  stat_summary(fun = mean,
               geom = "point",
               color = "yellow",
               size = 4) +
  guides(fill = "none") +
  scale_fill_manual(values =
                      c("0" = "deeppink",
                        "1" = "chocolate4")) +
  scale_x_discrete(labels = c("No", "Yes"),
                   name =
          "Does the candy contain chocolate?")

Exploratory Data Analysis

ggplot(candy, aes(x = factor(chocolate), 
                   y = winpercent, 
                  fill = factor(chocolate))) +
  geom_boxplot() +
  stat_summary(fun = mean,
               geom = "point",
               color = "yellow",
               size = 4) +
  guides(fill = "none") +
  scale_fill_manual(values =
                      c("0" = "deeppink",
                        "1" = "chocolate4")) +
  scale_x_discrete(labels = c("No", "Yes"),
                   name =
          "Does the candy contain chocolate?")

Exploratory Data Analysis

ggplot(candy, aes(x = factor(chocolate), 
                   y = winpercent, 
                  fill = factor(chocolate))) +
  geom_boxplot() +
  stat_summary(fun = mean,
               geom = "point",
               color = "yellow",
               size = 4) +
  guides(fill = "none") +
  scale_fill_manual(values =
                      c("0" = "deeppink",
                        "1" = "chocolate4")) +
  scale_x_discrete(labels = c("No", "Yes"),
                   name =
          "Does the candy contain chocolate?")

Exploratory Data Analysis

ggplot(candy, aes(x = factor(chocolate), 
                   y = winpercent, 
                  fill = factor(chocolate))) +
  geom_boxplot() +
  stat_summary(fun = mean,
               geom = "point",
               color = "yellow",
               size = 4) +
  guides(fill = "none") +
  scale_fill_manual(values =
                      c("0" = "deeppink",
                        "1" = "chocolate4")) +
  scale_x_discrete(labels = c("No", "Yes"),
                   name =
          "Does the candy contain chocolate?")

Fit the Linear Regression Model

Model Form:

\[ \begin{align} y &= \beta_0 + \beta_1 x + \epsilon \end{align} \]

When \(x = 0\):

When \(x = 1\):

mod <- lm(winpercent ~ chocolate, data = candy)
library(moderndive)
get_regression_table(mod)

# A tibble: 2 × 7
  term      estimate std_error statistic p_value lower_ci upper_ci
  <chr>        <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
1 intercept     42.1      1.65     25.6        0     38.9     45.4
2 chocolate     18.8      2.50      7.52       0     13.8     23.7

Notes

When the explanatory variable is categorical, \(\beta_0\) and \(\beta_1\) no longer represent the intercept and slope.
Now \(\beta_0\) represents the (population) mean of the response variable when \(x = 0\).
And, \(\beta_1\) represents the change in the (population) mean response going from \(x = 0\) to \(x = 1\).
Can also do prediction:

new_candy <- data.frame(chocolate = c(0, 1))
predict(mod, newdata = new_candy)

       1        2 
42.14226 60.92153

New example: Palmer Penguins

library(palmerpenguins)

Take a look at the data

glimpse(penguins)

Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

We’d like to predict a penguin’s bill length based on their species.

Response variable?

Explanatory variable?

Exploratory data analysis

penguins %>%
  group_by(species) %>%
  summarize(
    avg_bill_length = mean(bill_length_mm,
                           na.rm = TRUE)
    )

# A tibble: 3 × 2
  species   avg_bill_length
  <fct>               <dbl>
1 Adelie               38.8
2 Chinstrap            48.8
3 Gentoo               47.5

ggplot(penguins, 
       aes(x = species,
           y = bill_length_mm, 
           fill = species)) +
  geom_boxplot() +
  scale_fill_manual(values = c("steelblue",
                               "goldenrod", 
                               "plum3")) +
  guides(fill = "none") +
  theme_bw()

How do we handle more than 2 groups???

Boardwork

Fit the model in R

\[ y = \beta_0 + \beta_1 x_{species:Chinstrap} + \beta_2 x_{species:Gentoo} + \epsilon \]

R automatically makes species indicators for us and chooses a reference level.

penguin_mod <- lm(bill_length_mm ~ species, penguins)
get_regression_table(penguin_mod)

# A tibble: 3 × 7
  term               estimate std_error statistic p_value lower_ci upper_ci
  <chr>                 <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
1 intercept             38.8      0.241     161.        0    38.3     39.3 
2 species: Chinstrap    10.0      0.432      23.2       0     9.19    10.9 
3 species: Gentoo        8.71     0.36       24.2       0     8.01     9.42

Q: What is the equation for the fitted model?

\[\begin{aligned} \widehat{y} &= \widehat{\beta}_0 + \widehat{\beta}_1 \cdot x_{species:Chinstrap} + \widehat{\beta}_2 \cdot x_{species:Gentoo} \\ &= 38.8 + 10.0 \cdot x_{species:Chinstrap} + 8.71 \cdot x_{species:Gentoo} \end{aligned}\]

Remember to diagnose your models!

library(moderndive)
res <- get_regression_points(penguin_mod)
ggplot(res, aes(x = bill_length_mm_hat, y = residual)) +
  geom_point()

Remember to diagnose your models!

ggplot(res, aes(x = residual)) +
  geom_histogram()

ggplot(res, aes(sample = residual)) +
  geom_qq() +
  geom_qq_line()

Multiple Linear Regression: A peak into next week

Recall our penguin model

\[ y = \beta_0 + \beta_1 x_{species:Chinstrap} + \beta_2 x_{species:Gentoo} + \epsilon \]

Even though we are using one predictor (species), we now have \(\beta_0,~ \beta_1,\) and \(\beta_2\)!

We recoded the species predictor into two binary predictors
We are actually doing multiple linear regression now
Next week: We’ll formalize and extend multiple linear regression

Categorical explanatory variables are tricky! Revisit these examples, more practice with HW 04 Exercise 3

Activity: Changing Reference Level

We can change the reference level if we want.

penguins$species <- relevel(penguins$species, ref = "Chinstrap")
penguin_mod <- lm(bill_length_mm ~ species, penguins)
get_regression_table(penguin_mod)

# A tibble: 3 × 7
  term            estimate std_error statistic p_value lower_ci upper_ci
  <chr>              <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
1 intercept          48.8      0.359    136.     0        48.1    49.5  
2 species: Adelie   -10.0      0.432    -23.2    0       -10.9    -9.19 
3 species: Gentoo    -1.33     0.447     -2.97   0.003    -2.21   -0.449

Get in groups of ~5 and find a spot on one of the boards.

Q1: Write down the equation for this fitted model.
Q2: Draw a few rows of the data frame with the variables used by the model.
Q3: How is the interpretation of \(\beta_0\) and \(\widehat{\beta}_0\) different now?
Q4: How should we interpret the other two coefficients?

Activity: Changing Reference Level (Answers)

Q1: Write down the equation for this fitted model.

\[ \widehat{Y} = 48.8 - 10.0 x_{species:Adelie} - 1.3 x_{species:Gentoo} \]

Q2: Draw a few rows of the data frame with the variables used by the model.

# A tibble: 6 × 4
# Groups:   species [3]
  bill_length_mm species   x_adelie x_gentoo
           <dbl> <fct>        <dbl>    <dbl>
1           46.4 Chinstrap        0        0
2           49.5 Chinstrap        0        0
3           45.8 Adelie           1        0
4           36.4 Adelie           1        0
5           42.6 Gentoo           0        1
6           43.3 Gentoo           0        1

Q3: How is the interpretation of \(\beta_0\) and \(\widehat{\beta}_0\) different now?
- \(\beta_0\) (\(\widehat{\beta}_0\)) now represents (estimates) the population mean bill length for Chinstrap penguins, not Adelie penguins

Q4: How should we interpret the other two coefficients?
- They now represent the predicted change in bill lengths for Adelie and Gentoo penguins on average, relative to the predicted bill length for Chinstrap penguins.