02 Linear model example

Stat 406

Geoff Pleiss, Trevor Campbell

Last modified – 16 September 2024

\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\brt}{\widehat{\beta}^R_{s}} \newcommand{\brl}{\widehat{\beta}^R_{\lambda}} \newcommand{\bls}{\widehat{\beta}_{ols}} \newcommand{\blt}{\widehat{\beta}^L_{s}} \newcommand{\bll}{\widehat{\beta}^L_{\lambda}} \newcommand{\U}{\mathbf{U}} \newcommand{\D}{\mathbf{D}} \newcommand{\V}{\mathbf{V}} \]

Economic mobility

data("mobility", package = "Stat406")
mobility

# A tibble: 741 × 43
      ID Name        Mobility State Population Urban Black Seg_racial Seg_income
   <dbl> <chr>          <dbl> <chr>      <dbl> <dbl> <dbl>      <dbl>      <dbl>
 1   100 Johnson Ci…   0.0622 TN        576081     1 0.021      0.09       0.035
 2   200 Morristown    0.0537 TN        227816     1 0.02       0.093      0.026
 3   301 Middlesbor…   0.0726 TN         66708     0 0.015      0.064      0.024
 4   302 Knoxville     0.0563 TN        727600     1 0.056      0.21       0.092
 5   401 Winston-Sa…   0.0448 NC        493180     1 0.174      0.262      0.072
 6   402 Martinsvil…   0.0518 VA         92753     0 0.224      0.137      0.024
 7   500 Greensboro    0.0474 NC       1055133     1 0.218      0.22       0.068
 8   601 North Wilk…   0.0517 NC         90016     0 0.032      0.114      0.012
 9   602 Galax         0.0796 VA         64676     0 0.029      0.131      0.005
10   700 Spartanburg   0.0431 SC        354533     1 0.207      0.139      0.045
# ℹ 731 more rows
# ℹ 34 more variables: Seg_poverty <dbl>, Seg_affluence <dbl>, Commute <dbl>,
#   Income <dbl>, Gini <dbl>, Share01 <dbl>, Gini_99 <dbl>, Middle_class <dbl>,
#   Local_tax_rate <dbl>, Local_gov_spending <dbl>, Progressivity <dbl>,
#   EITC <dbl>, School_spending <dbl>, Student_teacher_ratio <dbl>,
#   Test_scores <dbl>, HS_dropout <dbl>, Colleges <dbl>, Tuition <dbl>,
#   Graduation <dbl>, Labor_force_participation <dbl>, Manufacturing <dbl>, …

A linear model

\[\mbox{Mobility}_i = \beta_0 + \beta_1 \, \mbox{State}_i + \beta_2 \, \mbox{Urban}_i + \cdots + \epsilon_i\]

or equivalently

\[E \left[ \biggl. \mbox{mobility} \, \biggr| \, \mbox{State}, \mbox{Urban}, \ldots \right] = \beta_0 + \beta_1 \, \mbox{State} + \beta_2 \, \mbox{Urban} + \cdots\]

Analysis

Randomly split into a training (say 3/4) and a test set (1/4)
Use training set to fit a model
Fit the “full” model
“Look” at the fit

set.seed(20220914)
mob <- mobility[complete.cases(mobility), ] |>
    select(-Name, -ID, -State)
n <- nrow(mob)
set <- sample.int(n, floor(n * .75), FALSE)
train <- mob[set, ]
test <- mob[setdiff(1:n, set), ]
full <- lm(Mobility ~ ., data = train)

Results

(dispatch happening here!)

summary(full)


Call:
lm(formula = Mobility ~ ., data = train)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.072092 -0.010256 -0.001452  0.009170  0.090428 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                1.849e-01  8.083e-02   2.288 0.022920 *  
Population                 3.378e-09  2.478e-09   1.363 0.173916    
Urban                      2.853e-03  3.892e-03   0.733 0.464202    
Black                      7.807e-02  2.859e-02   2.731 0.006735 ** 
Seg_racial                -5.626e-02  1.780e-02  -3.160 0.001754 ** 
Seg_income                 8.677e-01  9.355e-01   0.928 0.354453    
Seg_poverty               -7.416e-01  5.014e-01  -1.479 0.140316    
Seg_affluence             -2.224e-01  4.763e-01  -0.467 0.640874    
Commute                    6.313e-02  2.838e-02   2.225 0.026915 *  
Income                     4.207e-07  6.997e-07   0.601 0.548112    
Gini                       3.592e+00  3.357e+00   1.070 0.285578    
Share01                   -3.635e-02  3.357e-02  -1.083 0.279925    
Gini_99                   -3.657e+00  3.356e+00  -1.090 0.276704    
Middle_class               1.031e-01  4.835e-02   2.133 0.033828 *  
Local_tax_rate             2.268e-01  2.620e-01   0.866 0.387487    
Local_gov_spending         1.273e-07  3.016e-06   0.042 0.966374    
Progressivity              4.983e-03  1.324e-03   3.764 0.000205 ***
EITC                      -3.324e-04  4.528e-04  -0.734 0.463549    
School_spending           -9.019e-04  2.272e-03  -0.397 0.691658    
Student_teacher_ratio     -1.639e-03  1.123e-03  -1.459 0.145748    
Test_scores                2.487e-04  3.137e-04   0.793 0.428519    
HS_dropout                -1.698e-01  9.352e-02  -1.816 0.070529 .  
Colleges                  -2.811e-02  7.661e-02  -0.367 0.713942    
Tuition                    3.459e-07  4.362e-07   0.793 0.428417    
Graduation                -1.702e-02  1.425e-02  -1.194 0.233650    
Labor_force_participation -7.850e-02  5.405e-02  -1.452 0.147564    
Manufacturing             -1.605e-01  2.816e-02  -5.700  3.1e-08 ***
Chinese_imports           -5.165e-04  1.004e-03  -0.514 0.607378    
Teenage_labor             -1.019e+00  2.111e+00  -0.483 0.629639    
Migration_in               4.490e-02  3.480e-01   0.129 0.897436    
Migration_out             -4.475e-01  4.093e-01  -1.093 0.275224    
Foreign_born               9.137e-02  5.494e-02   1.663 0.097454 .  
Social_capital            -1.114e-03  2.728e-03  -0.408 0.683245    
Religious                  4.570e-02  1.298e-02   3.520 0.000506 ***
Violent_crime             -3.393e+00  1.622e+00  -2.092 0.037373 *  
Single_mothers            -3.590e-01  9.442e-02  -3.802 0.000177 ***
Divorced                   1.707e-02  1.603e-01   0.107 0.915250    
Married                   -5.894e-02  7.246e-02  -0.813 0.416720    
Longitude                 -4.239e-05  2.239e-04  -0.189 0.850001    
Latitude                   6.725e-04  5.687e-04   1.182 0.238037    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.02128 on 273 degrees of freedom
Multiple R-squared:  0.7808,    Adjusted R-squared:  0.7494 
F-statistic: 24.93 on 39 and 273 DF,  p-value: < 2.2e-16

Diagnostic plots

NB: the line in the QQ plot isn’t right for either geom_qq_line or plot.lm…

stuff <- tibble(
  residuals = residuals(full),
  fitted = fitted(full),
  stdresiduals = rstandard(full)
)
ggplot(stuff, aes(fitted, residuals)) +
  geom_point() +
  geom_smooth(
    se = FALSE,
    colour = "steelblue",
    linewidth = 2
  ) +
  ggtitle("Residuals vs Fitted")

ggplot(stuff, aes(sample = stdresiduals)) +
  geom_qq(size = 2) +
  geom_qq_line(linewidth = 2, color = "steelblue") +
  labs(
    x = "Theoretical quantiles",
    y = "Standardized residuals",
    title = "Normal Q-Q"
  )

Can also just use dispatched `plot`

plot(full, 1)

plot(full, 2)

Fit a reduced model

reduced <- lm(
  Mobility ~ Commute + Gini_99 + Test_scores + HS_dropout +
    Manufacturing + Migration_in + Religious + Single_mothers,
  data = train
)

summary(reduced)$coefficients

                    Estimate  Std. Error   t value     Pr(>|t|)
(Intercept)     0.1663344179 0.017769995  9.360409 1.829270e-18
Commute         0.0637329409 0.014926382  4.269819 2.618234e-05
Gini_99        -0.1086058241 0.038958986 -2.787696 5.642726e-03
Test_scores     0.0004997645 0.000256038  1.951915 5.186618e-02
HS_dropout     -0.2162067301 0.082003195 -2.636565 8.805228e-03
Manufacturing  -0.1594229237 0.020215791 -7.886059 5.647668e-14
Migration_in   -0.3891567027 0.171839168 -2.264657 2.423771e-02
Religious       0.0435673365 0.010463920  4.163577 4.084854e-05
Single_mothers -0.2864269552 0.046578928 -6.149282 2.444903e-09

reduced |>
  broom::glance() |>
  print(width = 120)

# A tibble: 1 × 12
  r.squared adj.r.squared  sigma statistic  p.value    df logLik    AIC    BIC
      <dbl>         <dbl>  <dbl>     <dbl>    <dbl> <dbl>  <dbl>  <dbl>  <dbl>
1     0.718         0.711 0.0229      96.9 5.46e-79     8   743. -1466. -1429.
  deviance df.residual  nobs
     <dbl>       <int> <int>
1    0.159         304   313

Diagnostic plots for reduced model

plot(reduced, 1)

plot(reduced, 2)

How do we decide which model is better?

Goodness of fit versus prediction power

map( # smaller AIC is better
  list(full = full, reduced = reduced),
  ~ c(aic = AIC(.x), rsq = summary(.x)$r.sq)
)

$full
          aic           rsq 
-1482.5981023     0.7807509 

$reduced
         aic          rsq 
-1466.088492     0.718245

Use both models to predict Mobility
Compare both sets of predictions

mses <- function(preds, obs) round(mean((obs - preds)^2), 5)
c(
  full = mses(
    predict(full, newdata = test),
    test$Mobility
  ),
  reduced = mses(
    predict(reduced, newdata = test),
    test$Mobility
  )
)

   full reduced 
0.00072 0.00084

Code

test$full <- predict(full, newdata = test)
test$reduced <- predict(reduced, newdata = test)
test |>
  select(Mobility, full, reduced) |>
  pivot_longer(-Mobility) |>
  ggplot(aes(Mobility, value)) +
  geom_point(color = "orange") +
  facet_wrap(~name, 2) +
  xlab("observed mobility") +
  ylab("predicted mobility") +
  geom_abline(slope = 1, intercept = 0, colour = "darkblue")

Next time…

Module 1

Selecting models