
Linear Models III: Categorical Predictors
Megan Ayers
Math 141 | Spring 2026
Friday, Week 4
So far we’ve considered this model when:
Response variable \((y)\): quantitative
Explanatory variable \((x)\): quantitative
AND, \(f()\) can be approximated by a line:
\[ \begin{align} y &= \beta_0 + \beta_1 x + \epsilon \end{align} \]
Linear regression is a flexible class of models that allow for:
Both quantitative and categorical explanatory variables.
Multiple explanatory variables.
Where the response variable is quantitative.
You’re planning when to head to brunch, and want to understand how long you should expect to wait for a table. Info from your previous experiences is below:
| Day | Wait (min) | Arrival |
|---|---|---|
| 1 | 10 | Early |
| 2 | 0 | Early |
| 3 | 5 | Early |
| 4 | 10 | Late |
| 5 | 20 | Late |
| 6 | 15 | Late |
| Day | Wait (min) | Arrival |
|---|---|---|
| 1 | 10 | Early |
| 2 | 0 | Early |
| 3 | 5 | Early |
| 4 | 10 | Late |
| 5 | 20 | Late |
| 6 | 15 | Late |
The simplest model would predict wait time using a constant value:
\[ y = \beta_0 + \epsilon \]
We can make a slightly more complicated model:
\[ y = \beta_0 + \beta_1x_\text{(arrived late)} + \epsilon \] where \(x_\text{(arrived late)}\) is either 0 or 1.
\[ \widehat{\text{Wait time}} = 5 + 10*x_\text{(arrived late)} \]
Response variable \((y)\): quantitative
Have 1 categorical explanatory variable \((w)\) with two categories
\(y\) is quantitative: so we need to convert \(w\) into a numeric variable. Call this \(x\), taking either the value 0 or 1.
| y | w | x |
|---|---|---|
| 10 | level A | 0 |
| 15 | level A | 0 |
| 25 | level B | 1 |
| 7 | level A | 0 |
| 20 | level B | 1 |
\[ \begin{align} y &= \beta_0 + \beta_1 x + \epsilon \end{align} \]
Rows: 85
Columns: 13
$ competitorname <chr> "100 Grand", "3 Musketeers", "One dime", "One quarter…
$ chocolate <dbl> 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
$ fruity <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1,…
$ caramel <dbl> 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
$ peanutyalmondy <dbl> 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ nougat <dbl> 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
$ crispedricewafer <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ hard <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,…
$ bar <dbl> 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
$ pluribus <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1,…
$ sugarpercent <dbl> 0.732, 0.604, 0.011, 0.011, 0.906, 0.465, 0.604, 0.31…
$ pricepercent <dbl> 0.860, 0.511, 0.116, 0.511, 0.511, 0.767, 0.767, 0.51…
$ winpercent <dbl> 66.97173, 67.60294, 32.26109, 46.11650, 52.34146, 50.…
What might be a good categorical explanatory variable of winpercent?
Before building the model, let’s explore and visualize the data!
Q: What dplyr functions should we use to find the mean and sd of winpercent by the categories of chocolate?
Q: What graph should we use to visualize the winpercent scores by chocolate?
ggplot(candy, aes(x = factor(chocolate),
y = winpercent,
fill = factor(chocolate))) +
geom_boxplot() +
stat_summary(fun = mean,
geom = "point",
color = "yellow",
size = 4) +
guides(fill = "none") +
scale_fill_manual(values =
c("0" = "deeppink",
"1" = "chocolate4")) +
scale_x_discrete(labels = c("No", "Yes"),
name =
"Does the candy contain chocolate?")
ggplot(candy, aes(x = factor(chocolate),
y = winpercent,
fill = factor(chocolate))) +
geom_boxplot() +
stat_summary(fun = mean,
geom = "point",
color = "yellow",
size = 4) +
guides(fill = "none") +
scale_fill_manual(values =
c("0" = "deeppink",
"1" = "chocolate4")) +
scale_x_discrete(labels = c("No", "Yes"),
name =
"Does the candy contain chocolate?")
ggplot(candy, aes(x = factor(chocolate),
y = winpercent,
fill = factor(chocolate))) +
geom_boxplot() +
stat_summary(fun = mean,
geom = "point",
color = "yellow",
size = 4) +
guides(fill = "none") +
scale_fill_manual(values =
c("0" = "deeppink",
"1" = "chocolate4")) +
scale_x_discrete(labels = c("No", "Yes"),
name =
"Does the candy contain chocolate?")
ggplot(candy, aes(x = factor(chocolate),
y = winpercent,
fill = factor(chocolate))) +
geom_boxplot() +
stat_summary(fun = mean,
geom = "point",
color = "yellow",
size = 4) +
guides(fill = "none") +
scale_fill_manual(values =
c("0" = "deeppink",
"1" = "chocolate4")) +
scale_x_discrete(labels = c("No", "Yes"),
name =
"Does the candy contain chocolate?")
ggplot(candy, aes(x = factor(chocolate),
y = winpercent,
fill = factor(chocolate))) +
geom_boxplot() +
stat_summary(fun = mean,
geom = "point",
color = "yellow",
size = 4) +
guides(fill = "none") +
scale_fill_manual(values =
c("0" = "deeppink",
"1" = "chocolate4")) +
scale_x_discrete(labels = c("No", "Yes"),
name =
"Does the candy contain chocolate?")
ggplot(candy, aes(x = factor(chocolate),
y = winpercent,
fill = factor(chocolate))) +
geom_boxplot() +
stat_summary(fun = mean,
geom = "point",
color = "yellow",
size = 4) +
guides(fill = "none") +
scale_fill_manual(values =
c("0" = "deeppink",
"1" = "chocolate4")) +
scale_x_discrete(labels = c("No", "Yes"),
name =
"Does the candy contain chocolate?")
Model Form:
\[ \begin{align} y &= \beta_0 + \beta_1 x + \epsilon \end{align} \]
When \(x = 0\):
When \(x = 1\):
When the explanatory variable is categorical, \(\beta_0\) and \(\beta_1\) no longer represent the intercept and slope.
Now \(\beta_0\) represents the (population) mean of the response variable when \(x = 0\).
And, \(\beta_1\) represents the change in the (population) mean response going from \(x = 0\) to \(x = 1\).
Can also do prediction:

Rows: 344
Columns: 8
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex <fct> male, female, female, NA, female, male, female, male…
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
We’d like to predict a penguin’s bill length based on their species.
Response variable?
Explanatory variable?
How do we handle more than 2 groups???
\[ y = \beta_0 + \beta_1 x_{species:Chinstrap} + \beta_2 x_{species:Gentoo} + \epsilon \]
R automatically makes species indicators for us and chooses a reference level.
# A tibble: 3 × 7
term estimate std_error statistic p_value lower_ci upper_ci
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 intercept 38.8 0.241 161. 0 38.3 39.3
2 species: Chinstrap 10.0 0.432 23.2 0 9.19 10.9
3 species: Gentoo 8.71 0.36 24.2 0 8.01 9.42
\[\begin{aligned} \widehat{y} &= \widehat{\beta}_0 + \widehat{\beta}_1 \cdot x_{species:Chinstrap} + \widehat{\beta}_2 \cdot x_{species:Gentoo} \\ &= 38.8 + 10.0 \cdot x_{species:Chinstrap} + 8.71 \cdot x_{species:Gentoo} \end{aligned}\]
Recall our penguin model
\[ y = \beta_0 + \beta_1 x_{species:Chinstrap} + \beta_2 x_{species:Gentoo} + \epsilon \]
Even though we are using one predictor (species), we now have \(\beta_0,~ \beta_1,\) and \(\beta_2\)!
We can change the reference level if we want.
# A tibble: 3 × 7
term estimate std_error statistic p_value lower_ci upper_ci
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 intercept 48.8 0.359 136. 0 48.1 49.5
2 species: Adelie -10.0 0.432 -23.2 0 -10.9 -9.19
3 species: Gentoo -1.33 0.447 -2.97 0.003 -2.21 -0.449
Get in groups of ~5 and find a spot on one of the boards.
\[ \widehat{Y} = 48.8 - 10.0 x_{species:Adelie} - 1.3 x_{species:Gentoo} \]
# A tibble: 6 × 4
# Groups: species [3]
bill_length_mm species x_adelie x_gentoo
<dbl> <fct> <dbl> <dbl>
1 46.4 Chinstrap 0 0
2 49.5 Chinstrap 0 0
3 45.8 Adelie 1 0
4 36.4 Adelie 1 0
5 42.6 Gentoo 0 1
6 43.3 Gentoo 0 1