Logistic Regression

Announcement
Acknowledgement
Heart disease example
Logistic regression
Fitting logistic regression
Interpretation
Inference (analysis of deviance)
Confidence intervals
Diagnostics
Goodness of fit
- Hosmer-Lemeshow statistic
- ROC curve
Model selection by AIC
Model selection by lasso

rm(list = ls()) # clean-up workspace
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.5     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.0.2     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(faraway)
sessionInfo()

## R version 4.1.1 (2021-08-10)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Big Sur 11.5.2
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] faraway_1.0.7   forcats_0.5.1   stringr_1.4.0   dplyr_1.0.7    
##  [5] purrr_0.3.4     readr_2.0.2     tidyr_1.1.4     tibble_3.1.5   
##  [9] ggplot2_3.3.5   tidyverse_1.3.1
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.7       lubridate_1.8.0  lattice_0.20-45  assertthat_0.2.1
##  [5] digest_0.6.28    utf8_1.2.2       R6_2.5.1         cellranger_1.1.0
##  [9] backports_1.3.0  reprex_2.0.1     evaluate_0.14    httr_1.4.2      
## [13] pillar_1.6.4     rlang_0.4.12     readxl_1.3.1     minqa_1.2.4     
## [17] rstudioapi_0.13  nloptr_1.2.2.2   jquerylib_0.1.4  Matrix_1.3-4    
## [21] rmarkdown_2.11   splines_4.1.1    lme4_1.1-27.1    munsell_0.5.0   
## [25] broom_0.7.10     compiler_4.1.1   modelr_0.1.8     xfun_0.27       
## [29] pkgconfig_2.0.3  htmltools_0.5.2  tidyselect_1.1.1 fansi_0.5.0     
## [33] crayon_1.4.2     tzdb_0.2.0       dbplyr_2.1.1     withr_2.4.2     
## [37] MASS_7.3-54      grid_4.1.1       nlme_3.1-153     jsonlite_1.7.2  
## [41] gtable_0.3.0     lifecycle_1.0.1  DBI_1.1.1        magrittr_2.0.1  
## [45] scales_1.1.1     cli_3.1.0        stringi_1.7.5    fs_1.5.0        
## [49] xml2_1.3.2       bslib_0.3.1      ellipsis_0.3.2   generics_0.1.1  
## [53] vctrs_0.3.8      boot_1.3-28      tools_4.1.1      glue_1.4.2      
## [57] hms_1.1.1        fastmap_1.1.0    yaml_2.2.1       colorspace_2.0-2
## [61] rvest_1.0.2      knitr_1.36       haven_2.4.3      sass_0.4.0

Announcement

Mid-term report (2 - 4 pages) by the end of week 12, reference lecture note
Presentation order:

set.seed(7360)
students <- c("Daniela-Florez", "Cong-Zhao", "Viraj-Puri", "Nadim-Hashmeh", 
              "Danting-Li", "Zhipeng-Chen", "Abigail-Zion", "Haoqi-Chen",
              "Christian-Frederiksen", "Zehao-Wang", "Alan-Braeley", "Keyan-Chen",
              "Jose-Silvestre", "Rebekah-Albach", "Travis-McDonald", "Haiyu-Niu",
              "Yuwei-Bao", "Oliver-Orejola", "Paul-Pluscht") %>%
  sample() %>%
  matrix(4, 5, byrow = TRUE, dimnames = list(c("1st", "2nd", "3rd", "4th"),
                                             c("Dec 10", "Dec 13", "Dec 15", "Dec 17", "Dec 20"))) %>%
  as_tibble(rownames = NA)
students[4, 5] <- ""
print(students)

## # A tibble: 4 × 5
##   `Dec 10`     `Dec 13`              `Dec 15`       `Dec 17`        `Dec 20`    
##   <chr>        <chr>                 <chr>          <chr>           <chr>       
## 1 Yuwei-Bao    Viraj-Puri            Daniela-Florez Cong-Zhao       "Jose-Silve…
## 2 Haoqi-Chen   Zhipeng-Chen          Nadim-Hashmeh  Travis-McDonald "Keyan-Chen"
## 3 Paul-Pluscht Christian-Frederiksen Oliver-Orejola Alan-Braeley    "Rebekah-Al…
## 4 Haiyu-Niu    Abigail-Zion          Danting-Li     Zehao-Wang      ""

You could switch with your classmates for presentation date/order
12 min presentation (10 min talk + 2 min questions)
- all on zoom to save transition time
Please send me an email if I didn’t notice you are graduating this semester.
- Final report due for graduating students 12/22/2021 (grades due 12/23/2021)
- Final report due 01/03/2022 (grades due 01/05/2022)
HW2 solutions posted

Acknowledgement

Dr. Hua Zhou’s slides

Heart disease example

The dataframe wcgs in faraway package contains data from the Western Collaborative Group Study.

wcgs %>% head(10)

##      age height weight sdp dbp chol behave cigs dibep chd  typechd timechd
## 2001  49     73    150 110  76  225     A2   25     B  no     none    1664
## 2002  42     70    160 154  84  177     A2   20     B  no     none    3071
## 2003  42     69    160 110  78  181     B3    0     A  no     none    3071
## 2004  41     68    152 124  78  132     B4   20     A  no     none    3064
## 2005  59     70    150 144  86  255     B3   20     A yes infdeath    1885
## 2006  44     72    204 150  90  182     B4    0     A  no     none    3102
## 2007  44     72    164 130  84  155     B4    0     A  no     none    3074
## 2008  40     71    150 138  60  140     A2    0     B  no     none    3071
## 2009  43     72    190 146  76  149     B3   25     A  no     none    3064
## 2010  42     70    175 132  90  325     A2    0     B  no     none    1032
##        arcus
## 2001  absent
## 2002 present
## 2003  absent
## 2004  absent
## 2005 present
## 2006  absent
## 2007  absent
## 2008  absent
## 2009  absent
## 2010 present

We convert the dataframe into a tibble for compatibility with tidyverse.

wcgs <- wcgs %>% 
  as_tibble() %>%
  print(width = Inf)

## # A tibble: 3,154 × 13
##      age height weight   sdp   dbp  chol behave  cigs dibep chd   typechd 
##    <int>  <int>  <int> <int> <int> <int> <fct>  <int> <fct> <fct> <fct>   
##  1    49     73    150   110    76   225 A2        25 B     no    none    
##  2    42     70    160   154    84   177 A2        20 B     no    none    
##  3    42     69    160   110    78   181 B3         0 A     no    none    
##  4    41     68    152   124    78   132 B4        20 A     no    none    
##  5    59     70    150   144    86   255 B3        20 A     yes   infdeath
##  6    44     72    204   150    90   182 B4         0 A     no    none    
##  7    44     72    164   130    84   155 B4         0 A     no    none    
##  8    40     71    150   138    60   140 A2         0 B     no    none    
##  9    43     72    190   146    76   149 B3        25 A     no    none    
## 10    42     70    175   132    90   325 A2         0 B     no    none    
##    timechd arcus  
##      <int> <fct>  
##  1    1664 absent 
##  2    3071 present
##  3    3071 absent 
##  4    3064 absent 
##  5    1885 present
##  6    3102 absent 
##  7    3074 absent 
##  8    3071 absent 
##  9    3064 absent 
## 10    1032 present
## # … with 3,144 more rows

For now, we focus just on variables
- chd, whether the person develops coronary heard disease or not,
- height, height of the person in inches,
- cigs, number of cigarettes smoked per day.

wcgs %>%
  select(chd, height, cigs) %>%
  summary()

##   chd           height           cigs     
##  no :2897   Min.   :60.00   Min.   : 0.0  
##  yes: 257   1st Qu.:68.00   1st Qu.: 0.0  
##             Median :70.00   Median : 0.0  
##             Mean   :69.78   Mean   :11.6  
##             3rd Qu.:72.00   3rd Qu.:20.0  
##             Max.   :78.00   Max.   :99.0

We use side-by-side boxplots to summarize the qulitative variable chd and quantitative variables height and cigs

ggplot(data = wcgs) +
  geom_boxplot(mapping = aes(x = chd, y = height))

and number of cigarretes smoked per day

ggplot(data = wcgs) +
  geom_boxplot(mapping = aes(x = chd, y = cigs))

It seems more cigarettes is associated with heard disease, but not height. How can we formally analyze this? If we use linear regression (straight line) for the anlaysis, the line will eventually extends beyond the [0, 1] range, making interpretation hard.

ggplot(data = wcgs) +
  geom_point(mapping = aes(x = cigs, y = chd))

Logistic regression

Bernoulli model for a binary response \[ Y_i = \begin{cases} 1 & \text{with probability } p_i \\ 0 & \text{with probability } 1 - p_i \end{cases} \]
The parameter $p_i = \mathbb{E}(Y_i)$ will be related to the predictors $X_1, \ldots, X_{q}$ via an inverse link function \[ p_i = \frac{e^{\eta_i}}{1 + e^{\eta_i}}, \] where $\eta_i$ is the linear predictor or systematic component \[ \eta_i = \beta_0 + \beta_1 x_{i1} + \cdots + \beta_{q} x_{iq} = \mathbf{x}_i^T \boldsymbol{\beta} \] with \[ \boldsymbol{\beta} = \begin{pmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_q \end{pmatrix}, \quad \mathbf{x}_i = \begin{pmatrix} 1 \\ x_{i1} \\ \vdots \\ x_{iq} \end{pmatrix}. \]
The function \[ \eta = g(p) = \log \left( \frac{p}{1-p} \right) \] that links $\mathbb{E}(Y)$ to the systematic component is called the link function. This particular link function is also called the logit function.
The function \[ p = g^{-1}(\eta) = \frac{e^\eta}{1 + e^\eta} \] is called the inverse link function. This particular function (inverse logit) is also called the logistic function. A graph of the logistic function:

ggplot(data = tibble(x = 0), mapping = aes(x = x)) + # null data
  stat_function(fun = ilogit) + # ilogit is from faraway
  xlim(-6, 6) + 
  labs(x = expression(eta), y = "p", title = "Logistic function (inverse link)")

# curve(ilogit(x), -6, 6, xlab = expression(eta), ylab = "p")

Therefore above model is called the logistic regression.

Fitting logistic regression

We use method of maximum likelihood (MLE) to estimate the parameters $\beta_0, \ldots, \beta_q$.
Given $n$ data points $(y_i, \mathbf{x}_i)$, $i=1,\ldots,n$, the log-likelihood is \[\begin{eqnarray*} \ell(\boldsymbol{\beta}) &=& \sum_i \log \left[p_i^{y_i} (1 - p_i)^{1 - y_i}\right] \\ &=& \sum_i \left[ y_i \log p_i + (1 - y_i) \log (1 - p_i) \right] \\ &=& \sum_i \left[ y_i \log \frac{e^{\eta_i}}{1 + e^{\eta_i}} + (1 - y_i) \log \frac{1}{1 + e^{\eta_i}} \right] \\ &=& \sum_i \left[ y_i \eta_i - \log (1 + e^{\eta_i}) \right] \\ &=& \sum_i \left[ y_i \cdot \mathbf{x}_i^T \boldsymbol{\beta} - \log (1 + e^{\mathbf{x}_i^T \boldsymbol{\beta}}) \right]. \end{eqnarray*}\]
Maximization of this log-likehood function can be carried out by the Newton-Raphson (also known as Fisher scoring) algorithm.

(lmod <- glm(chd ~ height + cigs, family = binomial, wcgs))

## 
## Call:  glm(formula = chd ~ height + cigs, family = binomial, data = wcgs)
## 
## Coefficients:
## (Intercept)       height         cigs  
##    -4.50161      0.02521      0.02313  
## 
## Degrees of Freedom: 3153 Total (i.e. Null);  3151 Residual
## Null Deviance:       1781 
## Residual Deviance: 1749  AIC: 1755

Summary of the result:

(lmod_sm <- summary(lmod))

## 
## Call:
## glm(formula = chd ~ height + cigs, family = binomial, data = wcgs)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0041  -0.4425  -0.3630  -0.3499   2.4357  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.50161    1.84186  -2.444   0.0145 *  
## height       0.02521    0.02633   0.957   0.3383    
## cigs         0.02313    0.00404   5.724 1.04e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1781.2  on 3153  degrees of freedom
## Residual deviance: 1749.0  on 3151  degrees of freedom
## AIC: 1755
## 
## Number of Fisher Scoring iterations: 5

Interpretation

Exercise: Before we attempt to interpret the results from logistic regression, we first need to understand how the data are transformed to $(y_i, \mathbf{x}_i)$.

# dataframe
wcgs %>%
  select(chd, height, cigs) %>%
  head(10)

## # A tibble: 10 × 3
##    chd   height  cigs
##    <fct>  <int> <int>
##  1 no        73    25
##  2 no        70    20
##  3 no        69     0
##  4 no        68    20
##  5 yes       70    20
##  6 no        72     0
##  7 no        72     0
##  8 no        71     0
##  9 no        72    25
## 10 no        70     0

# response
lmod$y %>% head(10)

##  1  2  3  4  5  6  7  8  9 10 
##  0  0  0  0  1  0  0  0  0  0

# predictors
model.matrix(lmod) %>% head(10)

##    (Intercept) height cigs
## 1            1     73   25
## 2            1     70   20
## 3            1     69    0
## 4            1     68   20
## 5            1     70   20
## 6            1     72    0
## 7            1     72    0
## 8            1     71    0
## 9            1     72   25
## 10           1     70    0

How to interpret the regression coefficients in logistic regression? Remember \[ \log \left( \frac{p}{1-p} \right) = \beta_0 + \beta_1 \cdot \text{height} + \beta_2 \cdot \text{cigs}. \] The quantity \[ o = \frac{p}{1-p} \] is called odds (of an event).

Therefore $\beta_1$ can be interpreted as a unit increase in $x_1$ with $x_2$ held fixed increases the log-odds of success by $\beta_1$, or increase the odds of success by a factor of $e^{\beta_1}$.
The gtsummary package presents regression results in a much nicer way that facilitates interpretation. Summarize the log-odds:

library(gtsummary)
lmod %>%
  tbl_regression() %>%
  bold_labels() %>%
  bold_p(t = 0.05)

Characteristic	log(OR)¹	95% CI¹	p-value
height	0.03	-0.03, 0.08	0.3
cigs	0.02	0.02, 0.03	<0.001
¹ OR = Odds Ratio, CI = Confidence Interval

Summarize the odds:

lmod %>%
  tbl_regression(intercept = TRUE, exponentiate = TRUE) %>%
  bold_labels() %>%
  bold_p(t = 0.05)

Characteristic	OR¹	95% CI¹	p-value
(Intercept)	0.01	0.00, 0.40	0.015
height	1.03	0.97, 1.08	0.3
cigs	1.02	1.02, 1.03	<0.001
¹ OR = Odds Ratio, CI = Confidence Interval

Exercise: Interpret the regression coefficients from wcgs fit.

# same as lmod$coefficients
# coef(lmod) is a named numeric vector
(beta_hat <- unname(coef(lmod)))

## [1] -4.50161397  0.02520779  0.02312740

exp(beta_hat)

## [1] 0.01109108 1.02552819 1.02339691

How to interpret the effect of a pack a day (20 cigarettes) on heart disease?

exp(beta_hat[3] * 20)

## [1] 1.588115

Suppose the probability of success in the presence of some condition is $p_1$ and $p_2$ in its absence. The relative risk or risk ratio is $p_1 / p_2$. For example, the predicted probability of a 68in tall person who smokes a pack (20 cigarettes) a day and who does not smoke are, respectively

(p1 <- ilogit(sum(beta_hat * c(1, 68, 20))))

## [1] 0.08907868

and

(p2 <- ilogit(sum(beta_hat * c(1, 68, 0))))

## [1] 0.05800425

Then the relative risk is

p1 / p2

## [1] 1.535727

When the probability of event is very small (rare disease assumption), i.e., $p_1, p_2 \approx 0$, then the odds ratio is approximately equal to the risk ratio \[ \frac{o_1}{o_2} = \frac{p_1 / (1 - p_1)}{p_2 / (1 - p_2)} \approx \frac{p_1}{p_2}. \]

Inference (analysis of deviance)

The deviance of a logistic regression fit is \[\begin{eqnarray*} D &=& 2 \sum_i \left[ y_i \log y_i + (1 - y_i) \log (1 - y_i) \right] \\ && - 2 \sum_i \left[ y_i \log \widehat{p}_i + (1 - y_i) \log (1 - \widehat{p}_i) \right] \\ &=& - 2 \sum_i \left[ y_i \log \widehat{p}_i + (1 - y_i) \log (1 - \widehat{p}_i) \right]. \end{eqnarray*}\] It comes from the likelihood ratio test (LRT) statistic \[ 2 \log \frac{L_{\Omega}}{L_{\omega}}, \] where $\Omega$ is the full/saturated model (same number of parameters as observations) and $\omega$ is the smaller model.
The usual goodness of fit test using $\chi_{n-q-1}^2$ asymptotic null distribution can not be applied here since we only have a single observation for each predictor pattern. This is different from the binomial model in next chapter. The Hosmer-Lemeshow test partitions the predicted probailities into $J$ bins and then carries out a Pearson $X^2$ type test to assess the goodness of fit (ELMR 2.6).
In the model output, the residual deviance, denoted $D_L$, is the devience of the current model and the null deviance, denoted $D_S$, is the deviance of the model with just an intercept term. Assuming the null model, the test statistic $D_S - D_L$ is asymptotically distributed $\chi_{\ell - s}^2$. In our case, the test statistic is

lmod$null.deviance - lmod$deviance

## [1] 32.19451

giving p-value

pchisq(lmod$null.deviance - lmod$deviance, 2, lower.tail = FALSE)

## [1] 1.02106e-07

Therefore our model gives a significantly better fit than the null (intercept-only) model.

We can also test the significance of individual predictor using analysis of deviance (anova function). For example, is height necessary in the model?

# fit a model without height
lmodc <- glm(chd ~ cigs, family = binomial, wcgs)
anova(lmodc, lmod, test = "Chi")

## Analysis of Deviance Table
## 
## Model 1: chd ~ cigs
## Model 2: chd ~ height + cigs
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1      3152       1750                     
## 2      3151       1749  1  0.92025   0.3374

Similar to linear regression, the convenience function drop1 tests each individual predictor in one shot.

drop1(lmod, test = "Chi")

## Single term deletions
## 
## Model:
## chd ~ height + cigs
##        Df Deviance    AIC     LRT Pr(>Chi)    
## <none>      1749.0 1755.0                     
## height  1   1750.0 1754.0  0.9202   0.3374    
## cigs    1   1780.1 1784.1 31.0695 2.49e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The coefficient test from summary is based on the z-value $\hat \beta_j / \text{se}(\hat{\beta}_j)$.

lmod_sm$coefficients

##                Estimate  Std. Error   z value     Pr(>|z|)
## (Intercept) -4.50161397 1.841862743 -2.444055 1.452321e-02
## height       0.02520779 0.026327385  0.957474 3.383281e-01
## cigs         0.02312740 0.004040183  5.724344 1.038339e-08

In general, deviance-based test is preferred over the z-test.

Confidence intervals

Confidence interval can be constructed either from normal approximation \[ \hat \beta_j \pm z^{\alpha / 2} \text{se}(\hat \beta_j) \]

tibble(
  `coef`  = beta_hat,
  `2.5%`  = beta_hat - 1.96 * lmod_sm$coefficients[, 2],
  `97.5%` = beta_hat + 1.96 * lmod_sm$coefficients[, 2])

## # A tibble: 3 × 3
##      coef  `2.5%` `97.5%`
##     <dbl>   <dbl>   <dbl>
## 1 -4.50   -8.11   -0.892 
## 2  0.0252 -0.0264  0.0768
## 3  0.0231  0.0152  0.0310

or from profile-likelihood

confint(lmod)

## Waiting for profiling to be done...

##                   2.5 %      97.5 %
## (Intercept) -8.13475465 -0.91297018
## height      -0.02619902  0.07702835
## cigs         0.01514949  0.03100534

Diagnostics

There are two kinds of fitted values (or predicted values). The first is on the scale of the linear predictor, $\eta$,

linpred  <- predict(lmod)
linpred %>% head(10)

##         1         2         3         4         5         6         7         8 
## -2.083261 -2.274521 -2.762277 -2.324936 -2.274521 -2.686653 -2.686653 -2.711861 
##         9        10 
## -2.108468 -2.737069

The second on the scale of response, $p = \text{logit}^{-1}(\eta)$,

predprob <- predict(lmod, type = "response")
predprob %>% head(10)

##          1          2          3          4          5          6          7 
## 0.11073449 0.09325523 0.05939705 0.08907868 0.09325523 0.06376553 0.06376553 
##          8          9         10 
## 0.06227708 0.10827647 0.06082112

We compute the raw residuals \[ y - \widehat{p} \]

# same as residuals(lmod, type = "response")
rawres <- lmod$y - predprob

The plot of raw residuals against the fitted values is not very informative.

wcgs %>%
  mutate(rawres = rawres, linpred = linpred) %>%
  ggplot() +
  geom_point(mapping = aes(x = linpred, y = rawres)) +
  labs(x = "Linear predictor", y = "Raw residuals")

We do not expect the raw residuals to have equal variance because the binary variance is $p(1 - p)$.

The deviance residuals are standardized residuals defined by \[ d_i = \text{sign}(y_i - \widehat{p}_i) \sqrt{-2 [y_i \log\widehat{p}_i + (1 - y_i) \log(1 - \widehat{p}_i)]}. \] Note \[ \sum_i d_i^2 = \text{deviance} \] in analogy to $\sum_i \widehat{\epsilon}_i^2 = \text{RSS}$ in linear regression. The term $\text{sign}(y_i - \widehat{p}_i)$ ensures that $d_i$ has the same sign as raw residual $y_i - \widehat{p}_i$.

devres <- residuals(lmod)
devres %>% head(10)

##          1          2          3          4          5          6          7 
## -0.4844779 -0.4424800 -0.3499548 -0.4319693  2.1782631 -0.3630133 -0.3630133 
##          8          9         10 
## -0.3586106 -0.4787466 -0.3542579

Sanity check:

sqrt(-2 * (lmod$y * log(predprob) + (1 - lmod$y) * log(1 - predprob))) %>%
  head(10)

##         1         2         3         4         5         6         7         8 
## 0.4844779 0.4424800 0.3499548 0.4319693 2.1782631 0.3630133 0.3630133 0.3586106 
##         9        10 
## 0.4787466 0.3542579

The plot of deviance residuals against the fitted values.

wcgs %>%
  mutate(devres = devres, linpred = linpred) %>%
  ggplot() +
  geom_point(mapping = aes(x = linpred, y = devres)) +
  labs(x = "Linear predictor", y = "Deviance residuals")

Again we see the residuals are clustered into two lines: the upper one corresponding to $y_i=1$ and the lower one to $y_i=0$. We can improve this plot by binning: divide the range of linear predictor into 100 bins of roughly equal points and plot average residual against average linear predictors per bin.

wcgs %>%
  mutate(devres = devres, linpred = linpred) %>% 
  group_by(cut(linpred, breaks = unique(quantile(linpred, (1:100)/101)))) %>%
  summarize(devres = mean(devres), 
            linpred = mean(linpred)) %>%
  ggplot() +
  geom_point(mapping = aes(x = linpred, y = devres)) + 
  labs(x = "Linear predictor", y = "Binned deviance residual")

Question: is there a concern that the deviance residuals are not centered around 0?

QQ plot is not helpful since there is no reason these deviance residuals are approximately standard normal.

qqnorm(devres)

Half-normal plot (hat values against half-normal quantiles) can help detect unusual cases in predictor space (high leverage cases). For logistic regression, we use the generalized hat matrix \[ \mathbf{H} = \widehat{\mathbf{W}}^{1/2} \mathbf{X}^T (\mathbf{X}^T \widehat{\mathbf{W}} \mathbf{X})^{-1} \mathbf{X} \widehat{\mathbf{W}}^{1/2}, \] where \[ \widehat{\mathbf{W}} = \begin{pmatrix} \widehat{p}_1 (1 - \widehat{p}_1) & \\ & \ddots & \\ & & \widehat{p}_n (1 - \widehat{p}_n) \end{pmatrix}. \]

halfnorm(hatvalues(lmod))

We see two high leverage cases, who smoke an unusual number of cigarettes per day!

wcgs %>%
  slice(c(2527, 2695)) %>%
  print(width = Inf)

## # A tibble: 2 × 13
##     age height weight   sdp   dbp  chol behave  cigs dibep chd   typechd timechd
##   <int>  <int>  <int> <int> <int> <int> <fct>  <int> <fct> <fct> <fct>     <int>
## 1    52     71    168   120    80   251 A1        99 B     no    none       2956
## 2    47     64    158   116    76   206 A1        80 B     no    none       2114
##   arcus  
##   <fct>  
## 1 present
## 2 absent

Plot of Cook distance against half-normal quantiles may reveal high influential points (high residual combined with high leverage) cases.

halfnorm(cooks.distance(lmod))

wcgs %>%
  slice(c(953, 2082)) %>%
  print(width = Inf)

## # A tibble: 2 × 13
##     age height weight   sdp   dbp  chol behave  cigs dibep chd   typechd timechd
##   <int>  <int>  <int> <int> <int> <int> <fct>  <int> <fct> <fct> <fct>     <int>
## 1    57     63    155   128    88   196 A2         0 B     yes   silent     2349
## 2    49     77    210   138    86   235 B3         0 A     yes   angina     3048
##   arcus 
##   <fct> 
## 1 absent
## 2 absent

Goodness of fit

Hosmer-Lemeshow statistic

Intuitively if we divide observations into $J$ bins according to linear predictors $\eta$, then $y_j / n_j$ (observed proportion of “successes”) for $j$-th bin should be close to the average predicted probabilities in that bin.

wcgs_binned <- wcgs %>%
  mutate(predprob = predict(lmod, type = "response"), 
         linpred  = predict(lmod, type = "link"),
         bin      = cut(linpred, breaks = unique(quantile(linpred, (1:100) / 101)))) %>%
  group_by(bin) %>%
  summarize(y       = sum(ifelse(chd == "yes", 1, 0)), 
            avgpred = mean(predprob), 
            count   = n()) %>%
  mutate(se_fit = sqrt(avgpred * (1 - avgpred) / count))

wcgs_binned %>%
  ggplot(mapping = aes(x = avgpred, y = y / count)) + 
  geom_point() +
  geom_linerange(mapping = aes(ymin = y / count - 2 * se_fit,
                               ymax = y / count + 2 * se_fit), alpha = 0.5) +
  geom_abline(intercept = 0, slope = 1) +
  labs(x = "Predicted probability", y = "Observed proportion")

The Hosmer-Lemeshow test formalizes this idea by testing the statistic \[ X_{\text{HL}}^2 = \sum_{j=1}^J \frac{(y_i - m_j \widehat{p}_j)^2}{m_j \widehat{p}_j (1 - \widehat{p}_j)} \] against the $\chi^2$ distribution with $J-1$ degrees of freedom.

# Hosmer-Lemeshow test statistic
(hlstat <- with(wcgs_binned, sum((y - count * avgpred)^2 / (count * avgpred * (1 - avgpred)))))

## [1] 64.87001

# J
nrow(wcgs_binned)

## [1] 52

# p-value
pchisq(hlstat, nrow(wcgs_binned) - 1, lower.tail = FALSE)

## [1] 0.09172918

We see a moderate p-value, which indicates no lack of fit.

ROC curve

Logistic regression is often used as a tool for classification.
If we choose a threshold, say 0.2, then the predicted probabilities give a classification rule \[ \text{case $i$ is a} \begin{cases} \text{"success"} & \text{if } \widehat{p}_i \ge 0.2 \\ \text{"failure"} & \text{if } \widehat{p}_i < 0.2 \end{cases}. \]

wcgs %>% 
  mutate(predprob = predict(lmod, type = "response")) %>% 
  mutate(predout = ifelse(predprob >= 0.2, "yes", "no")) %>%
  xtabs(~ chd + predout, data = .)

##      predout
## chd     no  yes
##   no  2886   11
##   yes  254    3

With this classification rule, we see the error rate is about

(11 + 254) / (2886 + 254 + 11 + 3)

## [1] 0.08402029

The sensitivity is \[ \frac{\text{TP}}{\text{TP + FN}} = \frac{3}{257} = 1.17\% \] and the specificity is \[ \frac{\text{TN}}{\text{FP + TN}} = \frac{2886}{11 + 2886} = 99.62\% \]

If we lower the threshold, then we increase the sensitivity but decrease the specificity. If we plot sensitivity against 1-specificity by varying the threshold, then we get the receiver operating characteristic (ROC) curve.

library(pROC)

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

lmod_roc <- roc(chd ~ predprob, wcgs)

## Setting levels: control = no, case = yes

## Setting direction: controls < cases

ggroc(lmod_roc)

The area under the curve (AUC) is a measure of the overal classification of the classifier. Larger AUC means better classification performance.

auc(lmod_roc)

## Area under the curve: 0.6007

Model selection by AIC

We can perform sequential search using the Akaike information criterion (AIC) \[ \text{AIC} = \text{deviance} + 2q. \] We start from a rich enough model

wcgs <- wcgs %>%
  # 1 in = 0.0254 m, 1 lb = 0.4536 kg
  mutate(bmi = 703 * weight / height)
biglm <- glm(chd ~ age + height + weight + bmi + sdp + dbp + chol + dibep + cigs + arcus,
             family = binomial, data = wcgs)

and then do sequential backward search using the step function

step(biglm, trace = TRUE, direction = "both") %>%
  tbl_regression() %>%
  bold_labels()

## Start:  AIC=1591.05
## chd ~ age + height + weight + bmi + sdp + dbp + chol + dibep + 
##     cigs + arcus
## 
##          Df Deviance    AIC
## - dbp     1   1569.1 1589.1
## - weight  1   1569.3 1589.3
## - bmi     1   1569.5 1589.5
## - height  1   1569.5 1589.5
## <none>        1569.0 1591.0
## - arcus   1   1571.1 1591.1
## - sdp     1   1576.8 1596.8
## - dibep   1   1590.4 1610.4
## - cigs    1   1592.0 1612.0
## - age     1   1593.7 1613.7
## - chol    1   1619.8 1639.8
## 
## Step:  AIC=1589.06
## chd ~ age + height + weight + bmi + sdp + chol + dibep + cigs + 
##     arcus
## 
##          Df Deviance    AIC
## - weight  1   1569.4 1587.4
## - bmi     1   1569.5 1587.5
## - height  1   1569.5 1587.5
## <none>        1569.1 1589.1
## - arcus   1   1571.2 1589.2
## + dbp     1   1569.0 1591.0
## - sdp     1   1586.2 1604.2
## - dibep   1   1590.4 1608.4
## - cigs    1   1592.5 1610.5
## - age     1   1593.7 1611.7
## - chol    1   1619.9 1637.9
## 
## Step:  AIC=1587.36
## chd ~ age + height + bmi + sdp + chol + dibep + cigs + arcus
## 
##          Df Deviance    AIC
## - height  1   1570.3 1586.3
## <none>        1569.4 1587.4
## - arcus   1   1571.5 1587.5
## + weight  1   1569.1 1589.1
## + dbp     1   1569.3 1589.3
## - bmi     1   1574.4 1590.4
## - sdp     1   1586.7 1602.7
## - dibep   1   1590.7 1606.7
## - cigs    1   1592.8 1608.8
## - age     1   1594.0 1610.0
## - chol    1   1620.0 1636.0
## 
## Step:  AIC=1586.32
## chd ~ age + bmi + sdp + chol + dibep + cigs + arcus
## 
##          Df Deviance    AIC
## <none>        1570.3 1586.3
## - arcus   1   1572.6 1586.6
## + height  1   1569.4 1587.4
## + weight  1   1569.5 1587.5
## + dbp     1   1570.3 1588.3
## - bmi     1   1577.3 1591.3
## - sdp     1   1587.2 1601.2
## - dibep   1   1591.9 1605.9
## - age     1   1594.2 1608.2
## - cigs    1   1594.6 1608.6
## - chol    1   1620.0 1634.0

Characteristic	log(OR)¹	95% CI¹	p-value
age	0.06	0.04, 0.08	<0.001
bmi	0.00	0.00, 0.00	0.008
sdp	0.02	0.01, 0.03	<0.001
chol	0.01	0.01, 0.01	<0.001
dibep
A	—	—
B	0.66	0.38, 1.0	<0.001
cigs	0.02	0.01, 0.03	<0.001
arcus
absent	—	—
present	0.22	-0.06, 0.50	0.13
¹ OR = Odds Ratio, CI = Confidence Interval

Model selection by lasso

A modern approach for model selection is the lasso, which minimizes the function
\[ - n^{-1} \ell(\boldsymbol{\beta}) + \lambda \sum_{j=1}^q |\beta_j|, \]
where $\ell(\boldsymbol{\beta})$ is the log-likelihood of logistic regression and $\lambda>0$ is a tuning parameter. We notice that
- when $\lambda = \infty$, all non-intercept regression coefficients will be pushed to 0, and
- when $\lambda = 0$, the regression coefficients are same as those from regular logistic regression.
  If we vary $\lambda$ from 0 to larger values, we will obtain intermediate models with lesser and lesser preditors. This way we are achieving continuous model selection.
For details of glmnet package, see the vignette at https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html
How do we choose $\lambda$, which determines the model size? One natural idea is to split the data into a training set and a validation set. The training set is used to fit the logistic regression at different $\lambda$ values. Then the validation set is used to evaluate and compare the performance of different models. We will choose the model that gives the best performance on the validation set.
First let’s remove cases with missing values

(wcgs <- wcgs %>% 
  select(-c(behave, typechd, timechd)) %>%
  drop_na())

## # A tibble: 3,140 × 11
##      age height weight   sdp   dbp  chol  cigs dibep chd   arcus     bmi
##    <int>  <int>  <int> <int> <int> <int> <int> <fct> <fct> <fct>   <dbl>
##  1    49     73    150   110    76   225    25 B     no    absent  1445.
##  2    42     70    160   154    84   177    20 B     no    present 1607.
##  3    42     69    160   110    78   181     0 A     no    absent  1630.
##  4    41     68    152   124    78   132    20 A     no    absent  1571.
##  5    59     70    150   144    86   255    20 A     yes   present 1506.
##  6    44     72    204   150    90   182     0 A     no    absent  1992.
##  7    44     72    164   130    84   155     0 A     no    absent  1601.
##  8    40     71    150   138    60   140     0 B     no    absent  1485.
##  9    43     72    190   146    76   149    25 A     no    absent  1855.
## 10    42     70    175   132    90   325     0 B     no    present 1758.
## # … with 3,130 more rows

split data into 80% training cases and 20% validation cases.

library(glmnet)

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack

## Loaded glmnet 4.1-2

library(caret)

## Loading required package: lattice

## 
## Attaching package: 'lattice'

## The following object is masked from 'package:faraway':
## 
##     melanoma

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

# set seed for reproducibility
set.seed(200)

# list = FALSE, request result to be in a matrix (of row position) not a list
training_samples <- wcgs$chd %>%
  createDataPartition(p = 0.8, list = FALSE)
# (train_data <- wcgs %>%
#   slice(training_samples[, 1]))
# (val_data   <- wcgs %>%
#   slice(-training_samples[, 1]))

The glmnet package takes a matrix (of predictors) and a vector (of responses) as input. We use model.matrix function to create them. glmnet will add intercept by default, so we drop intercept term when forming x matrix.

# X and y from original data
x_all <- model.matrix(
  chd ~ - 1 + age + height + weight + bmi + sdp + dbp + chol + dibep + cigs + arcus, 
  data = wcgs)
y_all <- ifelse(wcgs$chd == "yes", 1, 0)
# training X and y
x_train <- x_all[training_samples[, 1], ]
y_train <- y_all[training_samples[, 1]]
# validation X and y
x_val <- x_all[-training_samples[, 1], ]
y_val <- y_all[-training_samples[, 1]]

Fit lasso regression and plot solution path:

lasso_fit <- glmnet(x_train, y_train, family = "binomial", alpha = 1)
summary(lasso_fit)

##            Length Class     Mode     
## a0          49    -none-    numeric  
## beta       539    dgCMatrix S4       
## df          49    -none-    numeric  
## dim          2    -none-    numeric  
## lambda      49    -none-    numeric  
## dev.ratio   49    -none-    numeric  
## nulldev      1    -none-    numeric  
## npasses      1    -none-    numeric  
## jerr         1    -none-    numeric  
## offset       1    -none-    logical  
## classnames   2    -none-    character
## call         5    -none-    call     
## nobs         1    -none-    numeric

plot(lasso_fit, xvar = "lambda", label = TRUE)

Here
- $\alpha = 1$ corresponds to the lasso regression, in the family of elastic net penalties \[ (1 - \alpha) \|\boldsymbol{\beta}\|_2^2 / 2 + \alpha \|\boldsymbol{\beta}\|_1. \]
- Other choises for xvar are "lambda" for log lambda value, "norm" for the $\ell_1$-norm of the coefficients (default), and "dev" for the percentage of deviance explained.
Now we can evaluate the performance of the models (corresponding to different $\lambda$ values) on the validation set.

# predict validation case probabilities at different \lambda values and calculate test deviance
pred_val <- predict(lasso_fit, newx = x_val, type = "response", s = lasso_fit$lambda)
dev_val <- -2 * colSums(y_val * log(pred_val) + (1 - y_val) * log(1 - pred_val))
tibble(lambda = lasso_fit$lambda, dev_val = dev_val) %>%
  ggplot() + 
  geom_point(mapping = aes(x = lambda, y = dev_val)) + 
  scale_x_log10() +
  labs(y = "Binomial deviance on validation set", x = "Lambda")

From the graph, it seems that $\lambda = 0.005$ yields a model that performs best on the validation set.

Now the question is whether we should make decision on this single training-validation split? A common strategy is to use cross validation.

set.seed(200)
cv_lasso <- cv.glmnet(x_all, y_all, alpha = 1, family = "binomial", 
                      type.measure = "auc")
plot(cv_lasso)

The plot displays the cross-validation error (devience by default, we chose AUC here) according to the log of lambda. The left dashed vertical line indicates that the log of the optimal value of log lambda is approximately -5.2, which is the one that minimizes the average AUC. This lambda value will give the most accurate model. The exact value of lambda and corresponding model can be viewed as follow:

cv_lasso$lambda.min

## [1] 0.005696827

coef(cv_lasso, cv_lasso$lambda.min)

## 12 x 1 sparse Matrix of class "dgCMatrix"
##                         s1
## (Intercept)  -9.893281e+00
## age           4.954011e-02
## height        .           
## weight        5.334625e-03
## bmi           .           
## sdp           1.537576e-02
## dbp           .           
## chol          9.228292e-03
## dibepA       -5.162308e-01
## dibepB        2.725036e-14
## cigs          1.656021e-02
## arcuspresent  9.149817e-02

We see that this model differs from the best model chosen by AIC by replacing the predictor bmi by weight.