

Linear Models II: Accuracy and Assumptions
Megan Ayers
Math 141 | Spring 2026
Wednesday, Week 4

\[ y = f(x) + \epsilon \]
where \(\epsilon\) represents an error term.
Goal:
Determine a reasonable form for \(f()\). (Ex: Line, curve, …)
Estimate \(f()\) with \(\widehat{f}()\) using the data.
Generate predicted values: \(\widehat y = \widehat{f}(x)\).
\[ y = \beta_0 + \beta_1 x + \epsilon \]
Consider this model when:
Response variable \((y)\): quantitative
Explanatory variable \((x)\): quantitative
AND, \(f()\) can be approximated by a line.
Need to determine the best estimates of \(\beta_0\) and \(\beta_1\).
\[ y = \beta_0 + \beta_1 x + \epsilon \]
\[ \widehat{y} = \widehat{ \beta}_0 + \widehat{\beta}_1 x \]
Linear regression is used for 2 main tasks:
Recall our modeling goal: predict win percentage by using the sugar percentage variable.
candy <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/candy-power-ranking/candy-data.csv") %>%
mutate(sugarpercent = sugarpercent*100)
ggplot(data = candy,
mapping = aes(x = sugarpercent,
y = winpercent)) +
geom_point(alpha = 0.6, size = 4,
color = "chocolate4") +
geom_smooth(method = "lm", se = FALSE,
color = "deeppink2")
Want residuals to be small.
Minimize a function of the residuals.
Minimize:
\[ \sum_{i = 1}^n e^2_i \]

After minimizing the sum of squared residuals, you get the following equations:
Get the following equations:
\[ \begin{align} \widehat{\beta}_1 &= \frac{ \sum_{i = 1}^n (x_i - \bar{x}) (y_i - \bar{y})}{ \sum_{i = 1}^n (x_i - \bar{x})^2} \\ \widehat{\beta}_0 &= \bar{y} - \widehat{\beta}_1 \bar{x} \end{align} \] where
\[ \begin{align} \bar{y} = \frac{1}{n} \sum_{i = 1}^n y_i \quad \mbox{and} \quad \bar{x} = \frac{1}{n} \sum_{i = 1}^n x_i \end{align} \]
Then we can estimate the whole function with:
\[ \widehat{y} = \widehat{\beta}_0 + \widehat{\beta}_1 x \]
Called the least squares line, regression line, or the line of best fit.
We can use the lm() function to construct the simple linear regression model in R and the get_regression_table() function from moderndive to interpret it.
What is the fitted model form?
\[\begin{align*} \widehat{y} &= \widehat{\beta_0} + \widehat{\beta_1} \times x_{sugarpercent} \\ &= 44.6094 + 0.1192 \times x_{sugarpercent} \end{align*}\]
We can use the lm() function to construct the simple linear regression model in R and the get_regression_table() function to interpret it.
Q: How do we interpret the coefficients?
\[\begin{align*} \widehat{y} &= \widehat{\beta_0} + \widehat{\beta_1} \times x_{sugarpercent} \\ &= 44.6094 + 0.1192 \times x_{sugarpercent} \end{align*}\]
We need to be precise and careful when interpreting estimated coefficients!
Intercept: We expect/predict \(y\) to be \(\widehat{\beta}_0\) on average when \(x = 0\).
Slope: For a one-unit increase in \(x\), we expect/predict \(y\) to change by \(\widehat{\beta}_1\) units on average.
These interpretations are non-specific to the context of our model, but when we are interpreting coefficients, we always need to interpret the coefficients in context
\[\begin{align*} \widehat{y} &= \widehat{\beta_0} + \widehat{\beta_1} \times x_{sugarpercent} \\ &= 44.6094 + 0.1192 \times x_{sugarpercent} \end{align*}\]
Intercept: We expect/predict a candy’s win percentage to be 44.6094 on average when their sugar percentage is 0.
Slope: For a one-unit increase in sugar percentage, we expect/predict the win percentage of a candy to change by 0.1192 units on average.
1 2 3
47.23269 53.79082 62.49524
Careful to only predict values within the range of \(x\) values in the sample.
Make sure to investigate outliers: observations that fall far from the cloud of points.

# A tibble: 2 × 7
term estimate std_error statistic p_value lower_ci upper_ci
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 intercept 44.609 3.086 14.455 0 38.471 50.748
2 sugarpercent 0.119 0.056 2.145 0.035 0.009 0.23
\[\begin{align*} \widehat{y} &= \widehat{\beta_0} + \widehat{\beta_1} \times x_{sugarpercent} \\ &= 44.6094 + 0.1192 \times x_{sugarpercent} \end{align*}\]
What assumptions have we made?
We can always find the line of best fit to explore data, but…
To make accurate predictions or inferences, certain conditions should be met.
To responsibly use linear regression tools for prediction or inference, we require:
Linearity: The relationship between explanatory and response variables must be approximately linear
Independence: The observations should be independent of one another.
Normality: The distribution of residuals should be approximately bell-shaped, unimodal, symmetric, and centered at 0 at every “slice” of the explanatory variable
Equal Variability: Variance of residuals should be roughly constant across data set. Also called “homoscedasticity”. Models that violate this assumption are sometimes called “heteroscedastic”
Linearity
Independence
Normality
Equal Variability
y vs x scatterplot, residual plot, residual histogram, q-q plot)Linearity: The relationship between explanatory and response variables must be approximately linear


library(moderndive)
# Creating the model
lm1 <- lm(data = my_df, y1 ~ x)
# *** Pulling out model inputs, residuals, and fitted values ***
res1 <- get_regression_points(lm1)
# Plotting residuals vs fitted values
(g1 <- ggplot(res1, aes(x = y1_hat, y = residual)) +
geom_point() +
theme_bw() +
geom_smooth(method = "lm", se = F) +
labs(x = "fitted value", y = "residual", title = "Linear"))
Independence: The observations should be independent of one another

Normality: The distribution of residuals should be bell-shaped, unimodal, symmetric, and centered at 0 at every “slice” of the explanatory variable

Normality: The distribution of residuals should be bell-shaped, unimodal, symmetric, and centered at 0 at every “slice” of the explanatory variable

If residuals are non-Normal…
Normality: The distribution of residuals should be “Normal”
Equal Variability: Variance of residuals should be approximately constant across the data

Remember the outlier example:

What do diagnostics look like when we fit the teal model?

candy simple linear regression model
Let’s check if this meets the LINE assumptions
Q: First, what does this graph tell us about Linearity?
A common measure of the strength of a linear model is the coefficient of determination \(R^2\), (aka “R-squared”).
\[ R^2 = \frac{ \overbrace{s_y^2 - s_{e}^2}^{\textrm{Variation in Y explained by X}}} { \underbrace{s_y^2}_{\textrm{Variation in y}}} \ \ \ \ \ \ \ \ \ \ \ \ (\text{Reminder: } \text{Variance} = (\text{Standard Deviation})^2) \]
If \(R^2 \approx 1\): nearly all the variability in response is explained by variability in the explanatory variable.

If \(R^2 \approx 0\): almost none of the variability in response is explained by variability in the explanatory variable.
