Announcement

Acknowledgement

Dr. Hua Zhou’s slides

Introduction

Now we have learnt regression modeling of logistic, binomial and count responses. How about nonnegative real responses and so on? Generalized linear model (GLM) is a generic framework that encompasses normal regression, binomial regression, multinomial regression, and others. There are two essential components of the GLM framework: a distribution for response \(Y\) and a link function that relates mean of \(Y\) to covariates \(x\).

Exponential family distribution

Definition

In GLM, the distribution of \(Y\) is from the exponential family of distributions of form \[ f(y \mid \theta, \phi) = \exp \left[ \frac{y \theta - b(\theta)}{a(\phi)} + c(y, \phi) \right]. \] \(\theta\) is called the canonical parameter or natural parameter and represents the location while \(\phi\) is the dispersion parameter and represents the scale. Note the canonical parameter \(\theta\) is not necessarily the mean \(\mu\).

Examples

  1. Normal or Gaussian: \[ f(y \mid \theta, \phi) = \frac{1}{\sqrt{2\pi}\sigma} \exp \left[ - \frac{(y - \mu)^2}{2\sigma^2} \right] = \exp \left[ \frac{y\mu - \mu^2/2}{\sigma^2} - \frac 12 \left( \frac{y^2}{\sigma^2} + \log(2\pi \sigma^2) \right) \right]. \] So we can write \[\begin{eqnarray*} \theta &=& \mu \\ \phi &=& \sigma^2 \\ a(\phi) &=& \phi \\ b(\theta) &=& \theta^2/2 \\ c(y, \phi) &=& -\frac 12 (y^2/\phi + \log(2\pi \phi)). \end{eqnarray*}\]

  2. Binomial: If we treat \(Y\) as a binomial proportion; that is \(nY \sim \text{Bin}(n, p)\), then the density is \[\begin{eqnarray*} & & f(y \mid \theta, \phi) = \binom{n}{ny} p^{ny} (1 -p)^{n(1-y)} \\ &=& \exp \left[ ny \log p + n(1 - y) \log (1 - p) + \log \binom{n}{ny} \right] \\ &=& \exp \left[ \frac{y \log \frac{p}{1 - p} + \log (1 - p)}{1/n} + \log \binom{n}{ny} \right]. \end{eqnarray*}\] So we see \[\begin{eqnarray*} \theta &=& \log \frac{p}{1 - p} \\ \phi &=& 1 \\ a(\phi) &=& \frac{1}{n} \\ b(\theta) &=& - \log (1 - p) = \log (1 + \exp \theta) \\ c(y, \phi) &=& \log \binom{n}{ny}. \end{eqnarray*}\]

  3. Poisson: \[ f(y \mid \theta, \phi) = e^{-\mu} \frac{\mu^y}{y!} = \exp (y \log \mu - \mu - \log y!). \] So we have \[\begin{eqnarray*} \theta &=& \log \mu \\ \phi &=& 1 \\ a(\phi) &=& 1 \\ b(\theta) &=& \exp \theta \\ c(y, \phi) &=& - \log y!. \end{eqnarray*}\]

  4. Gamma has density \[ f(y \mid \nu, \lambda) = \frac{1}{\Gamma(\nu)} \lambda^{\nu} y^{\nu - 1} e^{-\lambda y}, \quad y > 0, \] where \(\nu\) is the shape parameter and \(\lambda\) is the scale parameter. For the purpose of GLM, it’s convenient to reparameterize by \(\lambda = \nu / \mu\) to get \[ f(y) = \frac{1}{\Gamma(\nu)} \left( \frac{\nu}{\mu} \right)^{\nu} y^{\nu - 1} e^{-y\nu / \mu} = \exp \left\{ \frac{- y \mu^{-1} - \log \mu}{\nu^{-1}} + (\nu-1) \log y + \nu \log \nu - \log \Gamma(\nu) \right\}. \] Now \(\mathbb{E}Y = \mu\) and \(\operatorname{Var}(Y) = \mu^2 / \nu = (\mathbb{E} Y)^2 / \nu\). So we have \[\begin{eqnarray*} \theta &=& - \mu^{-1} \\ \phi &=& \nu^{-1} \\ a(\phi) &=& \phi \\ b(\theta) &=& - \log (- \theta) \\ c(y, \phi) &=& (\phi^{-1} - 1) \log y - \phi^{-1} \log (\phi) - \log \Gamma(\phi^{-1}). \end{eqnarray*}\] Some books remove the minus sign in the canonical parameter/link which is fine provided we take account of this in any derivations. For the canonical link \(\eta = \mu^{-1}\), the systematic component can only be non-negative, which could cause problems. Other possible link functions are log link \(\eta = \log \mu\) and identity link \(\eta = \mu\).

  5. Many other distributions.

Moments

Exponential family distributions have mean and variance \[\begin{eqnarray*} \mathbb{E}Y &=& \mu = b'(\theta) \\ \operatorname{Var}Y &=& \sigma^2 = b''(\theta) a(\phi). \end{eqnarray*}\] Show this in Friday’s Lab session. Thus the function \(b\) determines the moments of \(Y\).

Fitting a GLM

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.5     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.0.2     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(faraway)
bliss
##   dead alive conc
## 1    2    28    0
## 2    8    22    1
## 3   15    15    2
## 4   23     7    3
## 5   27     3    4

Let’s manually implement the first iteration of the Fisher scoring algorithm.

p <- bliss$dead / 30
eta <- logit(p)
z <- eta
w <- 30 * p * (1 - p)
lmod <- lm(z ~ conc, weights = w, data = bliss)
coef(lmod)
## (Intercept)        conc 
##   -2.302462    1.153587

The Fisher scoring algorithm converges quickly after a few iterations.

y = bliss$dead
for (iter in 1:5) {
  eta <- lmod$fit
  p <- ilogit(eta)
  w <- 30 * p * (1 - p)
  z <- eta + (y - 30 * p) / w
  lmod <- lm(z ~ conc, weights = w, data = bliss)
  cat(iter, coef(lmod), "\n")
}
## 1 -2.323672 1.161847 
## 2 -2.32379 1.161895 
## 3 -2.32379 1.161895 
## 4 -2.32379 1.161895 
## 5 -2.32379 1.161895

Compare to the glm fit.

glm(cbind(dead, alive) ~ conc, family = binomial, data = bliss) %>%
  summary()
## 
## Call:
## glm(formula = cbind(dead, alive) ~ conc, family = binomial, data = bliss)
## 
## Deviance Residuals: 
##       1        2        3        4        5  
## -0.4510   0.3597   0.0000   0.0643  -0.2045  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.3238     0.4179  -5.561 2.69e-08 ***
## conc          1.1619     0.1814   6.405 1.51e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 64.76327  on 4  degrees of freedom
## Residual deviance:  0.37875  on 3  degrees of freedom
## AIC: 20.854
## 
## Number of Fisher Scoring iterations: 4
xm <- model.matrix(lmod)
wm <- diag(w)
sqrt(diag(solve(t(xm) %*% wm %*% xm)))
## (Intercept)        conc 
##   0.4178878   0.1814158

Hypothesis testing

GLM Deviance
Gaussian \(\sum_i (y_i - \hat \mu_i)^2\)
Poisson \(2\sum_i [y_i \log(y_i / \hat \mu_i) - (y_i - \hat \mu_i)]\)
Binomial \(2 \sum_i [y_i \log(y_i / \hat \mu_i) + (n_i - y_i) \log((n_i - y_i)/(n_i - \hat \mu_i))]\)
Gamma \(2 \sum_i [- \log(y_i / \hat \mu_i) + (y_i - \hat \mu_i) / \hat \mu_i]\)
Inverse Gaussian \(\sum_i (y_i - \hat \mu_i)^2 / (\hat \mu_i^2 y_i)\)
bliss
##   dead alive conc
## 1    2    28    0
## 2    8    22    1
## 3   15    15    2
## 4   23     7    3
## 5   27     3    4
modl <- glm(cbind(dead, alive) ~ conc, family = binomial, data = bliss)
summary(modl)
## 
## Call:
## glm(formula = cbind(dead, alive) ~ conc, family = binomial, data = bliss)
## 
## Deviance Residuals: 
##       1        2        3        4        5  
## -0.4510   0.3597   0.0000   0.0643  -0.2045  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.3238     0.4179  -5.561 2.69e-08 ***
## conc          1.1619     0.1814   6.405 1.51e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 64.76327  on 4  degrees of freedom
## Residual deviance:  0.37875  on 3  degrees of freedom
## AIC: 20.854
## 
## Number of Fisher Scoring iterations: 4
1-pchisq(deviance(modl),df.residual(modl))
## [1] 0.9445968

Diagnostics

Residuals

  • Pearson residual \[ r_p = \frac{y - \hat \mu}{\sqrt{\operatorname{Var}(\hat \mu)}}. \]

  • Deviance residual \[ r_D = \text{sign}(y - \hat \mu) \sqrt{d_i}, \] where \(d_i\) are summands in the calculation of deviance.

  • These are different kinds of residuals from the Binomial regression of the intesticide data bliss.

residuals(modl, type = "deviance") # deviance residuals
##           1           2           3           4           5 
## -0.45101510  0.35969607  0.00000000  0.06430235 -0.20449347
residuals(modl, type = "pearson") # Pearson residuals
##             1             2             3             4             5 
## -4.325234e-01  3.643729e-01 -3.648565e-15  6.414687e-02 -2.081068e-01
residuals(modl, type = "response") # response - fitted values
##             1             2             3             4             5 
## -2.250510e-02  2.834353e-02 -3.330669e-16  4.989802e-03 -1.082823e-02
bliss$dead / 30.0 - fitted(modl)
##             1             2             3             4             5 
## -2.250510e-02  2.834353e-02 -3.330669e-16  4.989802e-03 -1.082823e-02
residuals(modl, type = "working") # working response
##             1             2             3             4             5 
## -2.770876e-01  1.561410e-01 -1.332268e-15  2.748820e-02 -1.333195e-01
modl$residuals
##             1             2             3             4             5 
## -2.770876e-01  1.561410e-01 -1.332268e-15  2.748820e-02 -1.333195e-01

We mostly use the deviance residuals for diagnostics.

Leverage and influence

  • For GLM, the hat matrix is \[ \mathbf{H} = \mathbf{W}^{1/2} \mathbf{X} (\mathbf{X}^T \mathbf{W} \mathbf{X})^{-1} \mathbf{X}^T \mathbf{W}^{1/2}, \] where \(\mathbf{W}\) is the weight matrix at the fitted model. Diagonal elements of \(\mathbf{H}\) are the leverages \(h_i\). A larger value of leverage indicates that the fit may be sensitive to the response at case \(i\). Its predictor values are unusual in some way.
influence(modl)$hat
##         1         2         3         4         5 
## 0.4255049 0.4133068 0.3223765 0.4133068 0.4255049
  • The studentized residuals are \[ r_{SD} = \frac{r_D}{\sqrt{\hat \phi (1 - h_i)}}. \]
rstudent(modl)
##           1           2           3           4           5 
## -0.58478586  0.47213544  0.00000000  0.08386629 -0.27183519
  • Leverage only measures the potential to affect the fit whereas measures of influence (change in coefficients if we omit a case) more directly access the effect of each case on the fit.
influence(modl)$coef
##    (Intercept)         conc
## 1 -0.214001479  0.080663550
## 2  0.155671882 -0.047087302
## 3  0.000000000  0.000000000
## 4 -0.005841678  0.008417729
## 5  0.049263918 -0.036573429
  • Alternatively we can examine the Cook statistics \[ D_i = \frac{(\widehat{\boldsymbol{\beta}}_{(i)} - \widehat{\boldsymbol{\beta}})^T (\mathbf{X}^T \mathbf{W} \mathbf{X}) (\widehat{\boldsymbol{\beta}}_{(i)} - \widehat{\boldsymbol{\beta}})}{p \widehat \phi}. \]
cooks.distance(modl)
##            1            2            3            4            5 
## 1.205927e-01 7.970999e-02 4.673053e-30 2.470424e-03 2.791738e-02

Residual plots

  • We revisit Poisson regression example on the Galápagos data.
gala <- as_tibble(gala, rownames = "Island") %>%
  print(n = Inf)
## # A tibble: 30 × 8
##    Island       Species Endemics    Area Elevation Nearest Scruz Adjacent
##    <chr>          <dbl>    <dbl>   <dbl>     <dbl>   <dbl> <dbl>    <dbl>
##  1 Baltra            58       23   25.1        346     0.6   0.6     1.84
##  2 Bartolome         31       21    1.24       109     0.6  26.3   572.  
##  3 Caldwell           3        3    0.21       114     2.8  58.7     0.78
##  4 Champion          25        9    0.1         46     1.9  47.4     0.18
##  5 Coamano            2        1    0.05        77     1.9   1.9   904.  
##  6 Daphne.Major      18       11    0.34       119     8     8       1.84
##  7 Daphne.Minor      24        0    0.08        93     6    12       0.34
##  8 Darwin            10        7    2.33       168    34.1 290.      2.85
##  9 Eden               8        4    0.03        71     0.4   0.4    18.0 
## 10 Enderby            2        2    0.18       112     2.6  50.2     0.1 
## 11 Espanola          97       26   58.3        198     1.1  88.3     0.57
## 12 Fernandina        93       35  634.        1494     4.3  95.3  4669.  
## 13 Gardner1          58       17    0.57        49     1.1  93.1    58.3 
## 14 Gardner2           5        4    0.78       227     4.6  62.2     0.21
## 15 Genovesa          40       19   17.4         76    47.4  92.2   129.  
## 16 Isabela          347       89 4669.        1707     0.7  28.1   634.  
## 17 Marchena          51       23  129.         343    29.1  85.9    59.6 
## 18 Onslow             2        2    0.01        25     3.3  45.9     0.1 
## 19 Pinta            104       37   59.6        777    29.1 120.    129.  
## 20 Pinzon           108       33   18.0        458    10.7  10.7     0.03
## 21 Las.Plazas        12        9    0.23        94     0.5   0.6    25.1 
## 22 Rabida            70       30    4.89       367     4.4  24.4   572.  
## 23 SanCristobal     280       65  552.         716    45.2  66.6     0.57
## 24 SanSalvador      237       81  572.         906     0.2  19.8     4.89
## 25 SantaCruz        444       95  904.         864     0.6   0       0.52
## 26 SantaFe           62       28   24.1        259    16.5  16.5     0.52
## 27 SantaMaria       285       73  171.         640     2.6  49.2     0.1 
## 28 Seymour           44       16    1.84       147     0.6   9.6    25.1 
## 29 Tortuga           16        8    1.24       186     6.8  50.9    18.0 
## 30 Wolf              21       12    2.85       253    34.1 255.      2.33
modp <- glm(Species ~ . - Endemics - Island, family = poisson, data = gala)
summary(modp)
## 
## Call:
## glm(formula = Species ~ . - Endemics - Island, family = poisson, 
##     data = gala)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -8.2752  -4.4966  -0.9443   1.9168  10.1849  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.155e+00  5.175e-02  60.963  < 2e-16 ***
## Area        -5.799e-04  2.627e-05 -22.074  < 2e-16 ***
## Elevation    3.541e-03  8.741e-05  40.507  < 2e-16 ***
## Nearest      8.826e-03  1.821e-03   4.846 1.26e-06 ***
## Scruz       -5.709e-03  6.256e-04  -9.126  < 2e-16 ***
## Adjacent    -6.630e-04  2.933e-05 -22.608  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 3510.73  on 29  degrees of freedom
## Residual deviance:  716.85  on 24  degrees of freedom
## AIC: 889.68
## 
## Number of Fisher Scoring iterations: 5
  • For GLM, it’s better to plot the deviance residuals against the linear predictors \(\widehat{\eta}\) rather than the predicted responses. We expect to see a constant variance of the deviance residuals.
gala %>%
  mutate(devres  = residuals(modp, type = "deviance"),
         linpred = predict(modp, type = "link")) %>%
  ggplot() + 
  geom_point(mapping = aes(x = linpred, y = devres)) + 
  labs(x = expression(hat(eta)), y = "Deviance residual")

If we plot response residuals \(y_i - \widehat{\mu}_i\) against the linear predictor \(\widehat{\eta}\), then we see variance increases with \(\widehat{\mu}\) which is consistent with a Poisson model.

gala %>%
  mutate(resres  = residuals(modp, type = "response"),
         linpred = predict(modp, type = "link")) %>%
  ggplot() + 
  geom_point(mapping = aes(x = linpred, y = resres)) + 
  labs(x = expression(hat(eta)), y = "Response residual")

Check linearity of predictors

  • Let’s plot the response Species against the predictor Area. We see majority of islands have a small area with a few exceptions.
gala %>%
  ggplot() + 
  geom_point(mapping = aes(x = Area, y = Species))

  • Log transform of Area reveals a curvilinear relationship.
gala %>%
  ggplot() + 
  geom_point(mapping = aes(x = log(Area), y = Species))

  • Poisson regression uses a log link, which needs to be taken into account. Plotting the linearized response (working response) \[ z = \eta + (y - \mu) \frac{d\eta}{d \mu} \] against log(Area) shows a linear relationship.
gala %>%
  mutate(eta = predict(modp, type = "link"),
         mu  = predict(modp, type = "response"),
         res = residuals(modp, type = "response")) %>%
  ggplot() + 
  geom_point(mapping = aes(x = log(Area), y = eta + res / mu)) + 
  labs(x = "log(Area)", y = "Linearized response")

  • From the same reasoning, we fit a model with log transformation of all predictors. We see a substantial reduction in deviance.
modpl <- glm(Species ~ log(Area) + log(Elevation) + log(Nearest) + log(Scruz + 0.5) + log(Adjacent), family = poisson, data = gala)
summary(modpl)
## 
## Call:
## glm(formula = Species ~ log(Area) + log(Elevation) + log(Nearest) + 
##     log(Scruz + 0.5) + log(Adjacent), family = poisson, data = gala)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.4176  -2.7323  -0.4655   2.5544   8.3385  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       3.331143   0.286899  11.611  < 2e-16 ***
## log(Area)         0.350684   0.018004  19.479  < 2e-16 ***
## log(Elevation)    0.032465   0.057047   0.569  0.56929    
## log(Nearest)     -0.039966   0.014126  -2.829  0.00467 ** 
## log(Scruz + 0.5) -0.037242   0.013783  -2.702  0.00689 ** 
## log(Adjacent)    -0.089481   0.006945 -12.885  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 3510.7  on 29  degrees of freedom
## Residual deviance:  360.0  on 24  degrees of freedom
## AIC: 532.83
## 
## Number of Fisher Scoring iterations: 5

Half normal plot

  • Q-Q plot of the residuals is the standard way to check the normality assumption on the errors. For GLM, it’s better use a half-normal plot that compares the sorted absolute residuals and the quantiles of the half-normal distribution \[ \Phi^{-1} \left( \frac{n+i}{2n + i} \right), \quad i=1,\ldots,n. \] The residuals are not expected to be normally distributed, so we are not looking for an approximate straight line. We only seek outliers which may be identified as points off the trend. A half-normal plot is better for this purpose because in a sense the resolution of the plot is doubled by having all the points in one tail.
halfnorm(rstudent(modpl))

gali <- influence(modpl)
halfnorm(gali$hat)

Sandwich estimation

library(sandwich)
modg <- glm(Species ~ . - Endemics - Island, family = poisson, data = gala) 
modg %>%
  vcovHC() %>%
  diag() %>%
  sqrt()
##  (Intercept)         Area    Elevation      Nearest        Scruz     Adjacent 
## 0.3185550946 0.0012176343 0.0011939774 0.0228021845 0.0065382200 0.0006212657
# estimate overdispersion parameter
(dp <- sum(residuals(modg, type = "pearson")^2) / modg$df.residual)
## [1] 31.74914
summary(modg, dispersion = dp)
## 
## Call:
## glm(formula = Species ~ . - Endemics - Island, family = poisson, 
##     data = gala)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -8.2752  -4.4966  -0.9443   1.9168  10.1849  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.1548079  0.2915897  10.819  < 2e-16 ***
## Area        -0.0005799  0.0001480  -3.918 8.95e-05 ***
## Elevation    0.0035406  0.0004925   7.189 6.53e-13 ***
## Nearest      0.0088256  0.0102621   0.860    0.390    
## Scruz       -0.0057094  0.0035251  -1.620    0.105    
## Adjacent    -0.0006630  0.0001653  -4.012 6.01e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 31.74914)
## 
##     Null deviance: 3510.73  on 29  degrees of freedom
## Residual deviance:  716.85  on 24  degrees of freedom
## AIC: 889.68
## 
## Number of Fisher Scoring iterations: 5

Standard errors are closer to those from the overdispersion Poisson model.

Robust estimation

library(robust)
glmRob(Species ~ log(Area) + log(Adjacent), family = poisson, data = gala) %>%
  summary()