Formulas in R

reference R4DS

R uses formulas for most modelling functions.

A simple example: y ~ x translates to: \(y = \beta_0 + x\beta_1\).

Below is a summary for some popular uses

Use Example Translation
Simple example y ~ x \(y = \beta_0 + x\beta_1\)
Exclude intercept y ~ x - 1 or y ~ x + 0 \(y = x\beta\)
Add a continuous variable y ~ x1 + x2 \(y = \beta_0 + x_1 \beta_1 + x_2 \beta_2\)
Add a categorical variable y ~ x1 + sex \(y = \beta_0 + x_1 \beta_1 + \mbox{sex_male} \beta_2\)
Interactions (continuous and categorical) y ~ x1 * sex \(y = \beta_0 + x_1 \beta1 + \mbox{sex_male} \beta_2 + x_1 \mbox{sex_male} \beta_{12}\)
Interactions (continuous and continuous) y ~ x1 * x_2 \(y = \beta_0 + x_1 \beta1 + x_2 \beta_2 + x_1 x_2 \beta_{12}\)

You can specify transformations inside the model formula too (besides creating new transformed variables in your data frame).

For example, log(y) ~ sqrt(x1) + x2 is transformed to \(\log(y) = \beta_0 + \sqrt{x_1} \beta_1 + x_2 \beta_2\).

One thing to note is that if your transformation involves +, *, ^ or -, you will need to wrap it in I() so R doesn’t treat it like part of the model specification.

For example, y ~ x + I(x ^ 2) is translated to \(y = \beta_0 + x \beta_1 + x^2 \beta_2\).

And remember, R automatically drops redundant variables so x + x become x.

Q1 write out the translations for the formulas below:

  • y ~ x ^ 2 + x

  • y ~ x + x ^ 2 + x ^ 3

  • y ~ poly(x, 3)

  • y ~ (x1 + x2 + x3) ^2 + (x2 + x3) * (x4 + x5)

Model selection

In the class, we used stepwise regression. Now we visit the other two sequential methods:

  1. Forward selection:

    1. Determine a shreshold significance level for entering predictors into the model

    2. Begin with a model without any predictors, i.e., \(y = \beta_0 + \epsilon\)

    3. For each predictor not in the model, fit a model that adds that predictor to the working model. Record the t-statistic and p-value for the added predictor.

    4. Do any of the predictors considered in step 3 have a p-value less than the shreshold specified in step 1?

      • Yes: Add the predictor with the smallest p-value to the working model and return to step 3

      • No: Stop. The working model is the final model.

  2. Backwards elimination

    1. Determine a threshold significance level for removing predictors from the model.

    2. Begin with a model that includes all predictors.

    3. Examine the t-statistic and p-value of the partial regression coefficients associated with each predictor.

    4. Do any of the predictors have a p-value greater than the threshold specified in step 1?

      • Yes: Remove the predictor with the largest p-value from the working model and return to step 3.

      • No: Stop. The working model is the final model.

  3. Stepwise selection

    Combines forward selection and backwards elimination steps.

Now perform Model selection on the cheddar dataset from the faraway package.

Use taste as the response variable.

Q2a. perform forward model selection from the model taste ~ 1

library(faraway)
(small.lm <- lm(taste ~ 1, data = cheddar))
## 
## Call:
## lm(formula = taste ~ 1, data = cheddar)
## 
## Coefficients:
## (Intercept)  
##       24.53

Q2b perform backward selection

finish the code below

# start from the large model that include all
biglm <- lm(taste ~ (Acetic + H2S + Lactic) ^ 2, data = cheddar)

What is your final model?

What is the predicted taste value for Acetic = 4.5, H2S = 3.8 and Lactic = 1.00?

What is the prediction confidence interval? What causes the difference between them?