2  Linear Regression

Reading: Chapter 3 of our course textbook An Introduction to Statistical Learning

R-Codes for this Chapter

2.1 Simple Linear Regression

The linear regression model assumes a linear relationship between \(Y\) and the predictor(s) \(X\).

The simple (only one predictor) linear regression model: \[ Y\approx \beta_0 + \beta_1 X \]

For instance,
sales \(\approx \beta_0 + \beta_1\) TV

2.1.1 Estimating the Regression Coefficients

Let \[ (X_{1},Y_1),\dots,(X_{n},Y_n) \] denote a training data random sample. I.e.
\[ (X_{i},Y_i)\overset{iid}{\sim}(X,Y),\quad\text{for all}\quad i=1,\dots,n. \] Moreover, let \[ Y_i=\beta_0 + \beta_1 X_i +\epsilon_i \quad\text{for all}\quad i=1,\dots,n, \tag{2.1}\] where

  • \(\epsilon_i\) and \(X_i\) are independent from each other for all \(i=1,\dots,n\)
  • \(E(\epsilon_i)=0\) for all \(i=1,\dots,n\)
  • \(Var(\epsilon_i)=\sigma^2>0\) is constant for all \(i=1,\dots,n\)
Note

The assumption \(f(X) = \beta_0 + \beta_1 X\) may be a useful working model. However, despite what many textbooks might tell us, we seldom believe that the true (unknown) relationship is that simple.

For a given observed realization of the training data random sample \[ (x_1,y_1),\dots,(x_n,y_n) \] we choose \(\hat\beta_0\) and \(\hat\beta_1\) such that the Residual Sum of Squares criterion is minimized: \[ \begin{align*} \operatorname{RSS}\equiv \operatorname{RSS}(\hat{\beta}_0,\hat{\beta_1}) & = e_1^2 + \dots + e_n^2\\[2ex] &=\sum_{i=1}^n\left(y_i - \left(\hat\beta_0 + \hat\beta_1x_i\right)\right)^2\\[2ex] &=\sum_{i=1}^n\left(y_i - \hat{y}_i\right)^2 \end{align*} \] The minimizers are \[ \hat\beta_1=\frac{\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^n(x_i-\bar{x})^2} \] and \[ \hat\beta_0=\bar{y} - \hat\beta_1\bar{x}, \] where \(\bar{y}=\frac{1}{n}\sum_{i=1}^ny_i\) and \(\bar{x}=\frac{1}{n}\sum_{i=1}^nx_i\).

2.1.2 Assessing the Accuracy of the Coefficient Estimates

If the assumption of Equation 2.1 is correct, i.e., if the data is actually generated according to the model \[ Y_i=\beta_0 + \beta_1 X_i +\epsilon_i, \] then ordinary least squares estimators \[ \hat\beta_0\quad\text{and}\quad\hat\beta_1 \] are unbiased estimators, that is \[ \begin{align*} \operatorname{Bias}(\hat\beta_0)&=E(\hat\beta_0)-\beta_0=0\\ \operatorname{Bias}(\hat\beta_1)&=E(\hat\beta_1)-\beta_1=0. \end{align*} \tag{2.2}\] I.e., on average, the estimation results equal the true (unknown) parameters.

Note: In Equation 2.2, we consider \(\hat\beta_0\) and \(\hat\beta_1\) as random variables from which we can compute mean values. The estimators \(\hat\beta_0\) and \(\hat\beta_1\) are indeed random variables since they depend on the random variables in the random sample \((X_i,Y_i)\overset{iid}{\sim}(X,Y)\):
\[ \hat\beta_1=\frac{\sum_{i=1}^n(X_i-\bar{X})(Y_i-\bar{Y})}{\sum_{i=1}^n(X_i-\bar{X})^2} \] and \[ \hat\beta_0=\bar{Y} - \hat\beta_1\bar{X}. \]

In an actual data analysis, we only have one realization of the estimators \(\hat\beta_0\) and \(\hat\beta_1\) computed from the given dataset (i.e. the observed realization of the random sample). Each single estimation result will have estimation errors, i.e., \[ \hat\beta_0\neq \beta_0\quad\text{and}\quad\hat\beta_1\neq \beta_1. \]

The following code generates artificial data to reproduce the plot in Figure 3.3 of our course textbook ISLR.

## ###############################
## A function to generate data 
## similar to that shown in Fig 3.3
## ##############################

## A Function to simulate data
mySimpleRegrDataGenerator <- function(){
  n      <- 50                           # sample size
  beta_0 <- 0.1                          # intercept parameter
  beta_1 <- 5                            # slope parameter
  X      <- runif(n, min = -2, max = 2)  # predictor
  error  <- rnorm(n, mean = 0, sd = 8.5) # error term
  Y      <- beta_0 + beta_1 * X + error  # outcome 
  ##
  return(data.frame("Y" = Y, "X" = X))
}

## Generate a first realization of the data
set.seed(123)
data_sim <- mySimpleRegrDataGenerator()
head(data_sim)
            Y          X
1 -18.4853427 -0.8496899
2  12.9872926  1.1532205
3  -0.4167901 -0.3640923
4  -1.9138159  1.5320696
5  19.5667725  1.7618691
6  -5.3639241 -1.8177740

Using repeated samples form the data generating process defined in mySimpleRegrDataGenerator(), we can generate multiple estimation results \(\hat\beta_0\) and \(\hat\beta_1\) of the unknown simple linear regression parameters \(\beta_0\) and \(\beta_1\) and plot the corresponding empirical regression lines:

## Estimation
lm_obj <- lm(Y ~ X, data = data_sim)

## Plotting the results

## True (usually unknown) parameter values
beta_0 <- 0.1  # intercept parameter
beta_1 <- 5    # slope parameter

par(mfrow=c(1,2)) # Two plots side by side

## First Plot (fit for the first realization of the data)
plot(x = data_sim$X, y = data_sim$Y, xlab = "X", ylab = "Y")
abline(a = beta_0, b = beta_1, col = "red")
abline(lm_obj, col = "blue")

## Second Plot (fits for multiple data realizations)
plot(x = data_sim$X, y = data_sim$Y, xlab = "X", ylab = "Y", type = "n") # type = "n": empty plot
##
for(r in 1:10){
  data_sim_new <- mySimpleRegrDataGenerator()
  lm_obj_new   <- lm(Y ~ X, data=data_sim_new)
  abline(lm_obj_new, col = "lightskyblue")
}
## Adding the first fit
abline(a = beta_0, b = beta_1, col = "red", lwd = 2)
abline(lm_obj, col = "blue", lwd = 2)

Coding Challenge (Nr 1):

Use the above mySimpleRegrDataGenerator() function to generate many, for instance \(B=10000\), realizations of the artificial data generating process, and based on these \(B\) many realizations of \(\hat\beta_0\) and \(\hat\beta_1:\) \[ \hat\beta_{0,1},\hat\beta_{0,2},\dots,\hat\beta_{0,B} \] \[ \hat\beta_{1,1},\hat\beta_{1,2},\dots,\hat\beta_{1,B} \] The averages of these realizations can be used to approximate the true means of the estimators, i.e. \[ \begin{align*} E(\hat\beta_0)&\approx \frac{1}{B}\sum_{b=1}^B\hat\beta_{0,b}\\[2ex] E(\hat\beta_1)&\approx \frac{1}{B}\sum_{b=1}^B\hat\beta_{1,b} \end{align*} \] and thus also the biases \[ \begin{align*} \operatorname{Bias}(\hat\beta_0)&\approx \frac{1}{B}\sum_{b=}^B\hat\beta_{0,b} -\beta_0\\ \operatorname{Bias}(\hat\beta_1)&\approx \frac{1}{B}\sum_{b=}^B\hat\beta_{0,b}-\beta_1. \end{align*} \] The approximations become arbitrarily precise as \(B\to\infty\) due to the Law of Large Numbers.

Question: Are the estimators \(\hat\beta_0\) and \(\hat\beta_1\) unbiased for the artificial data generating process defined by our mySimpleRegrDataGenerator() function?

The magnitude of the estimation errors \[ \hat\beta_0-\beta_0\neq 0 \quad\text{and}\quad\hat\beta_1-\beta_1\neq 0. \] is expressed in unites of Standard Errors: \[ \operatorname{SE}(\hat\beta_0)=\sqrt{Var(\hat\beta_1)}=\sqrt{\sigma^2\left[\frac{1}{n}+\frac{\bar{x}^2}{\sum_{i=1}^n(x_i-\bar{x})^2}\right]} \] and \[ \operatorname{SE}(\hat\beta_1)=\sqrt{Var(\hat\beta_1)}=\sqrt{\frac{\sigma^2}{\sum_{i=1}^n(x_i-\bar{x})^2}}, \] where \[\begin{align*} \sigma^2 & =Var(\epsilon_i)\\ \Leftrightarrow\quad \sigma & =SD(\epsilon_i) = \sqrt{Var(\epsilon_i)} \end{align*}\] for all \(i=1,\dots,n.\)

Typically, the variance and thus also the Standard Deviation of the error term, \(\sigma = SD(\epsilon),\) are unknown, but we can estimate \(\sigma\) by the Residual Standard Error: \[\begin{align*} \hat{\sigma}=\operatorname{RSE} &=\sqrt{\frac{1}{n-2}\operatorname{RSS}}\\[2ex] &=\sqrt{\frac{1}{n-2}\sum_{i=1}^n(y_i-\hat{y}_i)^2}\\[2ex] &=\sqrt{\frac{1}{n-2}\sum_{i=1}^n\left(y_i-\left(\hat{\beta}_0+\hat{\beta}_1x_i\right)\right)^2} \end{align*}\] where \(\operatorname{RSS}=\sum_{i=1}^n(y_i-\hat{y}_i)^2\) are the residual sum of squares.

Note

We subtract \(2\) from the sample size \(n\) since \(n-2\) are the remaining degrees of freedom in the data after estimating two parameters \(\hat\beta_0\) and \(\hat\beta_1\).

Confidence Intervals

Knowing the standard errors \[ \operatorname{SE}(\hat\beta_0)\quad\text{and}\quad\operatorname{SE}(\hat\beta_1) \] allows us to construct approximate 95% Confidence Intervals: \[ \begin{align*} \operatorname{CI}_{\beta_0} &=\left[\hat{\beta}_0-2\operatorname{SE}(\hat\beta_0),\; \hat{\beta}_0+2\operatorname{SE}(\hat\beta_0)\right] = \hat\beta_0\pm 2\operatorname{SE}(\hat\beta_0)\\[2ex] \operatorname{CI}_{\beta_1} &=\left[\hat{\beta}_1-2\operatorname{SE}(\hat\beta_1),\; \hat{\beta}_1+2\operatorname{SE}(\hat\beta_1)\right] = \hat\beta_1\pm 2\operatorname{SE}(\hat\beta_1) \end{align*} \]

Interpretation

There is approximately a 95% change (in resamplings from the data generating process) that the random confidence interval \(\operatorname{CI}_{\beta_1}\) contains the true (fix) parameter value \(\beta_1\).

To understand the interpretation of confidence intervals, it is very instructive to look at visualizations:

Danger

Only the above frequentist point of view can be nicely interpreted. Interpreting a given, observed confidence interval is hard and often done wrong.

A given, observed confidence interval (computed from the observed realization of the training data) either contains the true parameter value or not and usually we do not know it.

Confidence Intervals for Statistical Hypothesis Testing

We can use a \((1-\alpha)\cdot 100\%\) confidence interval to do statistical hypothesis testing at the significance level \(0<\alpha<1.\) Typical significance levels:

  • \(\alpha=0.05\)
  • \(\alpha=0.01\)

Let us consider the following null-hypothesis \((H_0)\) that the true (usually unknown) value \(\beta_1\) equals the null-hypothetical value \(\beta^{(H_0)}_{1}\) versus the two-sided alternative hypothesis \((H_1)\) that the true (usually unknown) value \(\beta_1\) does not equal the null-hypothetical value \(\beta^{(H_0)}_{1}:\) \[ \begin{align*} H_0:&\;\beta_1=\beta^{(H_0)}_{1}\\ H_1:&\;\beta_1\neq \beta^{(H_0)}_{1} \end{align*} \]

Classic No-Effect Null-Hypothesis

For the special case, where the null-hypothetical value \(\beta^{(H_0)}_{1}=0\) we test the classic no-effect null-hypothesis: \[ \begin{align*} H_0:&\;\text{There is no relationship between $Y$ and $X$; i.e. $\beta_1=0$}\\ H_1:&\;\text{There is a relationship between $Y$ and $X$; i.e. $\beta_1\neq 0$} \end{align*} \]

Testing-Procedure:

  • If the observed (obs) realization of the confidence interval, \(\operatorname{CI}_{\beta_1,obs},\) contains the null-hypothetical value \(\beta^{(H_0)}_{1},\) i.e. \[ \begin{align*} \beta^{(H_0)}_{1}&\in\operatorname{CI}_{\beta_1,obs}\\ \Leftrightarrow\beta^{(H_0)}_{1}&\in\left[\hat{\beta}_{1,obs}-2\operatorname{SE}_{obs}(\hat\beta_1),\;\hat{\beta}_{1,obs}+2\operatorname{SE}_{obs}(\hat\beta_1)\right], \end{align*} \] then we cannot reject the null hypothesis that \(\beta_1=\beta^{(H_0)}_{1}.\)

  • If, however, the observed (obs) realization of the confidence interval, \(\operatorname{CI}_{\beta_1,obs},\) does not contain the null-hypothetical value \(\beta^{(H_0)}_{1},\) i.e. \[ \begin{align*} \beta^{(H_0)}_{1}&\not\in\operatorname{CI}_{\beta_1,obs}\\ \Leftrightarrow\beta^{(H_0)}_{1}&\not\in\left[\hat{\beta}_{1,obs}-2\operatorname{SE}_{obs}(\hat\beta_1),\;\hat{\beta}_{1,obs}+2\operatorname{SE}_{obs}(\hat\beta_1)\right], \end{align*} \] then we can reject the null hypothesis and adopt the alternative that \(\beta_1\neq\beta^{(H_0)}_{1}.\)

Probability of a Type I Error is Smaller than \(\alpha\)

If the null-hypothesis is true, i.e. if the true (unknown) \(\beta_1\) equals the null-hypothetical value \(\beta^{(H_0)}_{1},\) \[ \beta_1 = \beta^{(H_0)}_{1} \] then the random confidence interval \(\operatorname{CI}_{\beta_1}\) covers the deterministic null-hypothetical value \(\beta^{(H_0)}_{1}\) with probability at least \(1-\alpha\) \[ P(\beta^{(H_0)}_{1} \in \operatorname{CI}_{\beta_1}| H_0\text{ is true})\geq 1-\alpha. \] I.e., in \(100\) resamples we expect to see at least \((1-\alpha)\cdot 100\) coverage events.

Probability of a Type I Error:
Thus, the probability of falsely rejecting the null hypothesis even though the null hypothesis is true (i.e. doing a type I error) is smaller or equal to the chosen significance level \(\alpha,\) since \[ \begin{align*} P(\beta^{(H_0)}_{1} \in \operatorname{CI}_{\beta_1}| H_0\text{ is true})&\geq 1-\alpha\\ \Leftrightarrow\quad 1-P(\beta^{(H_0)}_{1} \not\in \operatorname{CI}_{\beta_1}| H_0\text{ is true})&\geq 1-\alpha\\ \Leftrightarrow\quad \underbrace{P(\beta^{(H_0)}_{1} \not\in \operatorname{CI}_{\beta_1}| H_0\text{ is true})}_{\text{Probability of a Type I Error}}&\leq \alpha \end{align*} \] Note: A type I error (rejecting \(H_0\) even though \(H_0\) is true) is also called a false positive event. We want that false positives happen only (very) rarely and therefore choose a small significance level \(\alpha\) such as \(\alpha=0.05\) or \(\alpha=0.01.\)

Coding Challenge (Nr 2):
  1. Use the above mySimpleRegrDataGenerator() function to generate one data set. Use this data set to compute the observed realization of the confidence interval, \(\operatorname{CI}_{\beta_1,obs}\) for the (in principle unknown) parameter \(\beta_1.\) Use this observed realization of the confidence interval to test \[ \begin{align*} H_0:&\;\beta_1=\beta^{(H_0)}_{1}\\ \text{versus}\quad H_1:&\;\beta_1\neq \beta^{(H_0)}_{1} \end{align*} \] with \(\beta^{(H_0)}_{1}=5;\) i.e. with a correct null-hypothesis, since \(\beta_1=5\) in mySimpleRegrDataGenerator().
    • Can you reject \(H_0\)? Is the test-decision in line with your expectation?
  2. Now, use the above mySimpleRegrDataGenerator() function to generate \(B=10,000\) many data sets. Compute for each of these data sets the observed realizations of the confidence interval, \[ \operatorname{CI}_{\beta_1,obs,b}\quad\text{for}\quad b=1,\dots,B. \]
    Use each of the observed confidence interval realizations to test \[ \begin{align*} H_0:&\;\beta_1=\beta^{(H_0)}_{1}\\ \text{versus}\quad H_1:&\;\beta_1\neq \beta^{(H_0)}_{1} \end{align*} \] with \(\beta^{(H_0)}_{1}=5;\) i.e. with a correct null-hypothesis, since \(\beta_1=5\) in mySimpleRegrDataGenerator().
    • What is the relative frequency of \(H_0\)-rejections? Is this relative frequency in line with your expectation?
  3. Repeat 1. and 2., but with \[ \beta^{(H_0)}_{1}=0; \] i.e. with a false null-hypothesis (since \(\beta_1=5\) in mySimpleRegrDataGenerator()).

Test Statistics for Statistical Hypothesis Testing

Standard errors can also be used to construct test statistics for statistical hypothesis testing. In the following, we look at the \(t\)-test statistic.

Choose a significance level \(0<\alpha<1\) such as, for instance,

  • \(\alpha=0.05\)
  • \(\alpha=0.01\)

Let us (again) consider the null-hypothesis \((H_0)\) that the true (usually unknown) value \(\beta_1\) equals the null-hypothetical value \(\beta^{(H_0)}_{1}\) versus the two-sided alternative hypothesis \((H_1)\) that the true (usually unknown) value \(\beta_1\) does not equal the null-hypothetical value \(\beta^{(H_0)}_{1}:\) \[ \begin{align*} H_0:&\;\beta_1=\beta^{(H_0)}_{1}\\ H_1:&\;\beta_1\neq \beta^{(H_0)}_{1} \end{align*} \]

Under \(H_0,\) i.e. if the (unknown) true parameter \(\beta_1\) equals the null-hypothetical value, i.e. if \(\beta_1=\beta^{(H_0)}_{1},\) the random \(t\)-test statistic has a \(t\)-distribution with \((n-2)\) degrees of freedom, \[ t=\frac{\hat\beta_1 - \beta^{(H_0)}_{1}}{\operatorname{SE}(\hat\beta_1)}\overset{H_0}{\sim}t_{(n-2)}. \]

Tip

In the multiple linear regression model \(Y_i=\beta_0+\sum_{j=1}^p\beta_jX_{ij}+\epsilon_i\) (Section 2.2), the \(t\)-test statistic for the \(j\)th predictor has the following null-distribution: \[ t=\frac{\hat\beta_j - \beta^{(H_0)}_{j}}{\operatorname{SE}(\hat\beta_j)}\overset{H_0}{\sim}t_{(n-p-1)}. \]

\(p\)-value:
The \(p\)-value is the probability of seeing a realization of the random \(t\)-test statistic, \(t,\) which is more extreme than the observed value of the test-statistic, \(t_{obs},\) \[\begin{align*} p_{obs} &=P_{H_0}\left(|t|\geq|t_{obs}|\right)\\[2ex] &=2\cdot\min\{P_{H_0}\left(t\geq t_{obs} \right),\; P_{H_0}\left(t\leq t_{obs} \right)\}. \end{align*}\] where, \(P_{H_0}\) means that the probability is computed “under \(H_0\);” i.e. for the scenario that \(H_0\) is true \((\beta_1=\beta^{(H_0)}_{1}).\)

Testing-Procedure:

  • If the observed realization of the \(p\)-value is larger than or equal to the significance level \[ p_{obs}\geq \alpha, \] then we cannot reject the null hypothesis.

  • If, however, the observed realization of the \(p\)-value is strictly smaller than the significance level \[ p_{obs}<\alpha, \] then we can reject the null hypothesis and adopt the alternative hypothesis.

Note

It can be shown that the above statistical hypothesis test based on confidence intervals is equivalent to the statistical hypothesis test based on the \(t\)-test statistic.

In both cases, the probability of falsely rejecting the null hypothesis even though the null hypothesis is true, is smaller than or equal to the chosen significance level \(\alpha.\)

2.1.3 Assessing the Accuracy of the Model

In tendency an accurate model has …

  • a low residual standard error \(\operatorname{RSE}\) \[ \operatorname{RSE}=\hat\sigma=\sqrt{\frac{\operatorname{RSS}}{n-2}} \]

  • a high \(R^2\)

\[ R^2=\frac{\operatorname{TSS}-\operatorname{RSS}}{\operatorname{TSS}}=1-\frac{\operatorname{RSS}}{\operatorname{TSS}}, \] where \(0\leq R^2\leq 1\) and \[ \begin{align*} \operatorname{TSS}&=\sum_{i=1}^n\left(y_i-\bar{y}\right)^2\\ \operatorname{RSS}&=\sum_{i=1}^n\left(y_i-\hat{y}_i\right)^2\\ \hat{y}_i&=\hat\beta_0+\hat\beta_1x_i \end{align*} \]

TSS: “Total Sum of Squares”

RSS: “Residual Sum of Squares”

Danger

Cautionary Note Nr 1: Do not forget that there is a irreducible error \(Var(\epsilon)=\sigma^2>0\). Thus

  • very low \(\operatorname{RSE}\) values, \(\operatorname{RSE}\approx 0\), and
  • very high \(R^2\) values, \(R^2\approx 1\),

can be warning signals indicating overfitting. While overfitting typically does not happen with a simple linear regression model, it can happen with a multiple linear regression model.

Cautionary Note Nr 2: The \(R^2\) and \(\operatorname{RSE}\) are only based on training data. In Chapter 1, we have seen that a proper assessment of the model accuracy needs to take into account test data.

\(R^2\) and correlation coefficient

In the case of the simple linear regression model, \(R^2\) equals the squared sample correlation coefficient between \(Y\) and \(X\), \[ R^2 = r_{yx}^2, \] where \[ r_{yx}=\frac{\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_{i=1}^n(x_i-\bar{x})^2}\sqrt{\sum_{i=1}^n(y_i-\bar{y})^2}}. \]

Tip

In the multiple linear regression model \(Y_i=\beta_0+\sum_{j=1}^p\beta_jX_{ij}+\epsilon_i\) (Section 2.2), the \(R^2\) equals the squared correlation between response and the fitted values: \[ R^2=r^2_{y\hat{y}} \] with \[ r_{y\hat{y}}=\frac{\sum_{i=1}^n(y_i-\bar{y})(\hat{y}_i-\bar{\hat{y}})}{\sqrt{\sum_{i=1}^n(y_i-\bar{y})^2}\sqrt{\sum_{i=1}^n(\hat{y}_i-\bar{\hat{y}})^2}}. \]

2.2 Multiple Linear Regression

The multiple linear regression model allows for more than only one predictor:
\[ Y\approx \beta_0 + \beta_1 X_1 + \dots + \beta_p X_p + \epsilon \]

For instance,
sales \(\approx \beta_0 + \beta_1\) TV \(+\beta_2\) radio \(+\beta_3\) newspaper \(+\epsilon\)

2.2.1 Estimating the Regression Coefficients

Let \[ (X_{11},\dots,X_{1p},Y_1),\dots,(X_{n1},\dots,X_{np},Y_n) \] denote a training data random sample. I.e.
\[ (X_{i1},\dots,X_{ip},Y_i)\overset{iid}{\sim}(X_{1},\dots,X_{p},Y) \] for all \(i=1,\dots,n.\)

Moreover, let \[ Y_i=\beta_0+\beta_1X_{i1}+\dots+\beta_p X_{ip}+\epsilon_i \tag{2.3}\] for all \(i=1,\dots,n,\) where

  • \((p+1)<n\)
  • \(\epsilon_i\) and the vector of predictors \((X_{i1},\dots,X_{ip})\) are independent from each other for all \(i=1,\dots,n\)
  • \(E(\epsilon_i)=0\) for all \(i=1,\dots,n\)
  • \(Var(\epsilon_i)=\sigma^2>0\) is constant for all \(i=1,\dots,n\)

Typically, it is more convenient to write the multiple linear regression model in vector and matrix notation. Let \[\begin{align*} X_i&=\left(\begin{matrix}X_{i0}\\X_{i1}\\ \vdots\\X_{ip}\end{matrix}\right)\quad\text{and}\quad \beta=\left(\begin{matrix}\beta_{0}\\\beta_1\\ \vdots\\\beta_{p}\end{matrix}\right) \end{align*}\] with \(X_{i0}=1\) for all \(i=1,\dots,n.\)

This allows us to write Equation 2.3 more compactly as \[ Y_i=X_i'\beta+\epsilon_i \quad\text{for all}\quad i=1,\dots,n. \]

Next, we can stack all components. Let \[\begin{equation*} Y=\left(\begin{matrix}Y_1\\ \vdots\\Y_n\end{matrix}\right)\quad\text{and}\quad \epsilon=\left(\begin{matrix}\epsilon_1\\ \vdots\\\epsilon_n\end{matrix}\right) \end{equation*}\] denote the \((n\times 1)\) vectors containing all response values \(Y_i\) and all error terms \(\epsilon_i\) of the random sample. Moreover, let \[\begin{align*} X &=\left(\begin{matrix} X_{10}&X_{11}&\dots&X_{1p}\\ \vdots&\vdots&\ddots&\vdots\\ X_{n0}&X_{n1}&\dots&X_{np}\\ \end{matrix}\right)\\[2ex] &=\left(\begin{matrix} \;\;1\;\;&X_{11}&\dots&X_{1p}\\ \;\;\vdots\;\;&\vdots&\ddots&\vdots\\ \;\;1\;\;&X_{n1}&\dots&X_{np}\\ \end{matrix}\right) \end{align*}\] denote the \((n\times (p+1))\)-dimensional matrix containing all predictor values of the random sample, where the first column is a column full of ones \((X_{10}=1,\dots,X_{n0}=0).\)

This allows us to write Equation 2.3 even more compactly as \[ Y=X\beta+\epsilon. \]

Using matrix algebra, it can be shown that the ordinary least squares estimator of \(\beta=(\beta_0,\beta_1,\dots,\beta_p)'\) is given by the following \(((p+1)\times 1)\) vector: \[ \left(\begin{matrix}\hat\beta_1\\ \vdots \\ \hat\beta_K\end{matrix}\right)=\hat{\beta}=(X'X)^{-1}X'Y. \]

For a given observed realization of the training data random sample \[ (x_{11},\dots,x_{1p},y_1),\dots,(x_{n1},\dots,x_{np},y_n), \] the estimator \(\hat\beta=(\beta_0,\beta_1,\dots,\beta_p)'\) is selected by minimizing \[\begin{align*} \operatorname{RSS} &=\sum_{i=1}^n\left(y_i-\hat{y}_i\right)^2\\[2ex] &=\sum_{i=1}^n\left(y_i-\left(\hat\beta_0 + \hat\beta_1 x_{i1} \dots + \hat\beta_p x_{ip}\right)\right)^2. \end{align*}\]

Since the estimator \[ \hat\beta=\left(\begin{matrix}\hat\beta_1\\ \vdots \\ \hat\beta_K\end{matrix}\right) \] depends on the random sample, it is itself a \((p+1)\)-dimensional random variable. For each realization of the training data random sample, we observe a realization of the estimator.

In an actual data analysis, however, we only have one realization of the estimators \(\hat\beta_0,\hat\beta_1,\dots,\hat\beta_p\) computed from the given training dataset (i.e. the observed realization of the training data random sample).

Interpretation of Multiple Linear Regressions and the Omitted Variable Bias

Multiple linear regression is more than mere composition of single simple linear regression models. Take a look at the following two simple linear regression results:

Observations:

  • In the first simple linear regression, we see a statistical significant effect of radio on sales; i.e. we can reject the null hypotheses \[ H_0:\beta_{radio}=0 \] and adopt the alternative hypotheses \[ H_1:\beta_{radio}\neq 0. \]

  • In the second simple linear regression, we see a statistical significant effect of newspaper on sales; i.e. we can reject the null hypotheses \[ H_0:\beta_{newspaper}=0 \] and adopt the alternative hypotheses \[ H_1:\beta_{newspaper}\neq 0. \]

By contrast, when looking at the multiple linear regression when regressing sales onto

  • TV,
  • radio and
  • newspaper,

then the effect of newspaper becomes statistically insignificant; see Table 3.4.

Reason: Omitted Variable Bias

The reason for this change from a statistically significant effect of newspaper in the simple linear regression, to an insignificant effect in the multiple linear regression is the so-called Omitted Variable Bias.

Explanation of the omitted variables bias:

  • radio has a true positive effect on sales
  • newspaper has actually no effect on sales
  • But, newspaper is correlated with radio \(r_{\texttt{newspaper},\texttt{radio}}=0.3541\); see Table 3.5

  • Thus, when omitting radio from the regression model, newspaper becomes a surrogate for radio and we see a spurious effect.
Danger

Interpreting statistically significant results as true effects (“a change of \(X_j\) by one unit causes on average a change in \(Y\) by \(\beta_{j}\)”) is a delectate thing.

Even if the \(f(X)\) is really so simple that we can write it as a simple or multiple linear regression model, we may miss to include all relevant predictor variables and thus statistically significant results may only be spurious effects due to omitted variables.

Interpretation of the Coefficients in Table 3.4

For fixed values of TV and newspaper, spending additionally 1000 USD for radio, increases on average sales by approximately 189 units.

2.2.2 Inference on \(\beta_1,\dots,\beta_p\)

The \(t\)-test statistic (equivalently the confidence interval for \(\beta_j\)) allows us to test a null-hypothesis about one parameter \(\beta_j.\)

To test whether there is a relationship between the response \(Y\) and total vector predictors \((X_1,\dots,X_p)\) we can use the \(F\)-test statistic.

In this case, the \(F\)-test tests the null-hypothesis \[ \begin{align*} H_0:&\;\beta_1=\beta_2=\dots=\beta_p=0\\ \text{versus}\quad H_1:&\;\text{at least one $\beta_j\neq 0$; $j=1,\dots,p$} \end{align*} \]

\(F\)-test statistic \[ F=\frac{(\operatorname{TSS}-\operatorname{RSS})/p}{\operatorname{ RSS}/(n-p-1)}\overset{H_0}{\sim} F_{p,n-p-1} \] Under \(H_0,\) i.e. if \(H_0\) is true, the \(F\)-test statistic has a \(F\)-distribution with \(p\) numerator and \((n-p-1)\) denominator degrees of freedom.

If \(H_0\) is correct \[ \begin{align*} E((\operatorname{TSS}-\operatorname{RSS})/p)&=\sigma^2\\[2ex] E(\operatorname{RSS}/(n-p-1))&=\sigma^2 \end{align*} \]

Therefore:

  • If \(H_0\) is correct, we expect values of \(F\approx 1.\)
  • If \(H_1\) is correct, we expect values of \(F\gg 1.\)

\(p\)-value

\[\begin{align*} p_{obs} &=P_{H_0}\left( F \geq F_{obs} \right), \end{align*}\] where \(F_{obs}\) denotes the observed value of the \(F\)-test statistic computed from the observed training data, and where \(F\) is a random variable that has a \(F_{q,n-p-1}\) distribution. \(P_{H_0}\) means that the probability is computed for the scenario that \(H_0\) is true.

To do the statistical hypothesis test, we need to select a significance level \(\alpha\) (e.g. \(\alpha=0.01\) or \(\alpha=0.05\)).

  • If the observed realization of the \(p\)-value is larger than or equal to the significance level \[ p_{obs}\geq \alpha, \] then we cannot reject the null hypothesis.

  • If, however, the observed realization of the \(p\)-value is strictly smaller than the significance level \[ p_{obs}<\alpha, \] then we can reject the null hypothesis and adopt the alternative hypothesis.

2.3 Other Considerations in the Regression Model

2.3.1 Qualitative Predictors

Often some predictors are qualitative variables (also known as a factor variables). For instance, the Credit dataset contains the following qualitative predictors:

  • own (house ownership: yes/no)
  • student (student status: yes/no)
  • status (marital status: yes/no)
  • region (regions: east, west or south)

Predictors with Only Two Levels

If a qualitative predictor (factor) only has two levels (i.e. possible values), then incorporating it into a regression model is very simple: We simply create an indicator or dummy variable that takes on two possible numerical values; for instance, \[ x_{i} = \left\{ \begin{array}{ll} 1&\quad \text{if the $i$th person owns a house}\\ 0&\quad \text{if the $i$th person does not own a house.} \end{array}\right. \] Using this dummy variable as a predictor in the regression equation results in the following regression model: \[\begin{align*} y_{i} &=\beta_0 + \beta_1 x_i + \epsilon_i\\[2ex] &= \left\{ \begin{array}{ll} \beta_0 + \beta_1 + \epsilon_i &\quad \text{if the $i$th person owns a house}\\ \beta_0 + \epsilon_i &\quad \text{if the $i$th person does not own a house} \end{array}\right. \end{align*}\]

Interpretation:

  • \(\beta_0\): The average credit card balance among those who do not own a house
  • \(\beta_0+\beta_1\): The average credit card balance among those who do own a house
  • \(\beta_1\): The average difference in credit card balance between owners and non-owners

Alternatively, instead of a 0/1 coding scheme, we could create a dummy variable \[ x_{i} = \left\{ \begin{array}{ll} 1 &\quad \text{if the $i$th person owns a house}\\ -1 &\quad \text{if the $i$th person does not own a house.} \end{array}\right. \] \[\begin{align*} y_{i} &=\beta_0 + \beta_1 x_i + \epsilon_i\\[2ex] &= \left\{ \begin{array}{ll} \beta_0 + \beta_1 + \epsilon_i&\quad \text{if the $i$th person owns a house}\\ \beta_0 - \beta_1 + \epsilon_i&\quad \text{if the $i$th person does not own a house} \end{array}\right. \end{align*}\]

Interpretation:

  • \(\beta_0\): The overall average credit card balance (ignoring the house ownership effect)
  • \(\beta_1\): The average amount by which house owners and non-owners have credit card balances that are above and below the overall average, respectively.

Qualitative Predictors with More than Two Levels

When a qualitative predictor has more than two levels, a single dummy variable cannot represent all possible values. In this situation, we can create additional dummy variables. For example, for the

region \(\in\{\)South, West, East\(\}\)

variable, we create two dummy variables. The first could be \[ x_{i1} = \left\{ \begin{array}{ll} 1&\quad \text{if the $i$th person is from the South}\\ 0&\quad \text{if the $i$th person is not from the South,} \end{array}\right. \] and the second could be \[ x_{i2} = \left\{ \begin{array}{ll} 1&\quad \text{if the $i$th person is from the West}\\ 0&\quad \text{if the $i$th person is not from the West.} \end{array}\right. \] Using both of these dummy variables results in the following regression model: order to obtain the model \[\begin{align*} y_{i}&=\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \epsilon_i\\[2ex] &= \left\{ \begin{array}{ll} \beta_0 + \beta_1 + \epsilon_i& \quad \text{if the $i$th person is from the South}\\ \beta_0 + \beta_2 + \epsilon_i& \quad \text{if the $i$th person is from the West}\\ \beta_0 + \epsilon_i& \quad \text{if the $i$th person is from the East.}\\ \end{array}\right. \end{align*}\]

Interpretation:

  • \(\beta_0\): The average credit card balance for individuals from the East
  • \(\beta_1\): The difference in the average balance between people from the South versus the East
  • \(\beta_2\): The difference in the average balance between people from the West versus the East

There are many different ways of coding qualitative variables besides the dummy variable approach taken here. All of these approaches lead to equivalent model fits, but the coefficients are different and have different interpretations, and are designed to measure particular contrasts. (A detailed discussion of contrasts is beyond the scope of this lecture.)

2.3.2 Extensions of the Linear Model

Interaction Effects

Previously, we used the following model

sales \(= \beta_0 + \beta_1\) TV \(+ \beta_2\) radio \(+ \beta_3\) newspaper \(+\epsilon\)

which states, for instance, that the average increase in sales associated with a one-unit increase in TV is \(\beta_1,\) regardless of the amount spent on radio.

However, this simple model may be incorrect. Suppose that there is a synergy effect, such that spending money on radio advertising actually increases the effectiveness of TV advertising.

Figure 3.5 suggests that such an effect may be present in the advertising data:

  • When levels of either TV or radio are low, then the true sales are lower than predicted by the linear model.
  • But when advertising is split between the two media, then the model tends to underestimate sales.

Solution: Interaction Effects:

Consider the standard linear regression model with two variables, \[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \epsilon. \] Here each predictor \(X_1\) and \(X_2\) has a given effect, \(\beta_1\) and \(\beta_2\), on \(Y\) and this effect does not depend on the value of the other predictor. (Additive Assumption)

One way of extending this model is to include a third predictor, called an interaction term, which is constructed by computing the product of \(X_1\) and \(X_2.\) This results in the model \[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 \overbrace{\color{red}X_1X_2}^{=X_3} + \epsilon. \] This is a powerful extension relaxing the additive assumption. Notice that the model can now be written as \[ \begin{align*} Y &= \beta_0 + \underbrace{(\beta_1 + \beta_3 X_2)}_{=\tilde{\beta}_1(X_2)} X_1 + \beta_2 X_2 + \epsilon, \end{align*} \] where the new slope parameter \(\tilde{\beta}_2(X_2)\) is a linear function of \(X_2,\) i.e. \[ \tilde{\beta}_1(X_2)=\beta_1 + \beta_3 X_2. \]

Thus, a change in the value of \(X_2\) will change the association between \(X_1\) and \(Y.\)

A similar argument shows that a change in the value of \(X_1\) changes the association between \(X_2\) and \(Y.\)

Let us return to the Advertising example:
A linear model that predicts sales using

  • the predictor radio,
  • the predictor TV, and
  • the interaction radio\(\times\)radio

takes the form

sales \(= \beta_0 + \beta_1\times\) TV \(+ \beta_2\times\) radio \(+ \beta_3\times(\) radio\(\times\) TV\()+\epsilon\)


which can be rewritten as

sales \(=\beta_0 + (\beta_1+ \beta_3\times\) radio \()\times\) TV \(+ \beta_2\times\) radio \(+\epsilon\)


Interpretation:

  • \(\beta_3\) denotes the increase in the effectiveness of TV advertising associated with a one-unit increase in radio advertising.

Interpretation of Table 3.9:

  • Both separate main effects, TV and radio, are statistically significant (\(p\)-values smaller than 0.01).
  • Additionally, the \(p\)-value for the interaction term, TV\(\times\)radio, is extremely low, indicating that there is strong evidence for \(H_1: \beta_3\neq 0.\) In other words, it is clear that the true relationship is not additive.
Hierarchical Principle of Interaction Terms

If we include an interaction in a model, we should also include the main effects, even if the \(p\)-values associated with their coefficients are not significant.

Interactions with Qualitative Variables:

An interaction between a qualitative variable and a quantitative variable has a particularly nice interpretation.

Consider the Credit data set and suppose that we wish to predict balance using the predictors:

  • income (quantitative) and
  • student (qualitative) using a dummy variable with \[ x_{i2}=\left\{ \begin{array}{ll} 1&\text{if person $i$ is a student}\\ 0&\text{if not}\\ \end{array} \right. \]

In the absence of an interaction term, the model takes the form

Thus, the regression lines for students and non-students have different intercepts, \(\beta_0+\beta_2\) versus \(\beta_0\), but the same slope \(\beta_1\).

This represents a potentially serious limitation of the model, since a change in income may have a very different effect on the credit card balance of a student versus a non-student.

This limitation can be addressed by adding an interaction variable, created by multiplying income with the dummy variable for student. Our model now becomes

Now we have different intercepts for students and non-students but also different slopes for these groups.

Polynomial Regression: Non-linear Relationships

Polynomial regression allows to accommodate non-linear relationships between the predictors \(X\) and the outcome \(Y.\)

For example, the points in Figure 3.8 seem to have a quadratic shape, suggesting that a model of the form

mpg \(=\beta_0 + \beta_1\times\) horsepower \(+ \beta_2\times(\)horsepower\()^2+\epsilon\)


This regression model involves predicting mpg using a non-linear function of horsepower.

Important

But it is still a linear model! It’s simply a multiple linear regression model \[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \epsilon \] with

  • \(X_1=\)horsepower and
  • \(X_2 =(\)horsepower\()^2\)

as the predictor variables.

Since this is nothing but a multiple linear regression model, we can use standard linear regression software to estimate \(\beta_0\), \(\beta_1\), and \(\beta_2\) in order to fit the (quadratic) non-linear regression function.

2.3.3 Potential Problems

1. Non-linearity of the response-predictor relationships.

Diagnostic residual plots are most useful to detect possible non-linear response-predictor relationships.

library("ISLR2")
data(Auto) 

## Gives the variable names in the Auto dataset
# names(Auto)

## Simple linear regression
lmobj_1 <- lm(mpg ~ horsepower, data = Auto)

## Quadratic regression 
lmobj_2 <- lm(mpg ~ horsepower + I(horsepower^2), data = Auto)

## Diagnostic Plot
par(mfrow = c(1,2))
plot(lmobj_1, which = 1)
plot(lmobj_2, which = 1)

Plotting residuals versus fitted values, i.e. plotting the data pairs \[ (\underbrace{y_i - \hat{y}_i}_{=e_i}, \hat{y}_i),\quad\text{for all}\quad i=1,\dots,n \] is a useful graphical tool for identifying non-linearity which works for the

  • simple linear regression model with \(\hat{y}_i=\hat{\beta}_0+\hat{\beta}_1x_{i1}\) and the
  • multiple linear regression model with \(\hat{y}_i=\hat{\beta}_0+\sum_{j=1}^p\hat{\beta}_jx_{ij}\)

If the residual plot indicates that there are non-linear associations in the data, then a simple approach is to use non-linear transformations of the predictors, such as \[ \log(X),\; \sqrt{X},\; \text{or}\; X^2 \] in the regression model.

2. Correlation of Error Terms

An important assumption of the linear regression model is that the error terms, \[ \epsilon_1, \epsilon_2, \dots , \epsilon_n, \] are independent and thus uncorrelated. What does this mean? For instance, if the errors are uncorrelated, then the fact that \(\epsilon_i\) is positive provides little or no information about the sign of \(\epsilon_{i+1}.\)

Auto-correlations among the error terms typically occur in time series data. Figure 3.10 shows time-series of residuals with

  • no auto-correlation (\(\rho=0\))
  • intermediate auto-correlation (\(\rho=0.5\))
  • strong auto-correlation (\(\rho=0.9\))

3. Non-Constant Variance of Error Terms (Heteroskedasticity)

Another important assumption of the linear regression model is that the error terms have a constant variance, \[ Var(\epsilon_i) = \sigma^2,\quad\text{for all}\quad i=1,\dots,n. \]

One can identify non-constant variances (“heteroskedasticity”) in the errors, using diagnostic residual plots.

Often one observes that the magnitude of the scattering of the residuals tends to increase with the fitted values. When faced with this problem, one possible solution is to transform the response \(Y\) using a concave function such as \[ \log(Y)\;\text{ or }\; \sqrt{Y}. \] Such a transformation results in a greater amount of shrinkage of the larger responses.

## Quadratic regression 
lmobj_2 <- lm(mpg ~ horsepower + I(horsepower^2), data = Auto)

## Quadratic regression with transformed response log(Y)
lmobj_3 <- lm(I(log(mpg)) ~ horsepower + I(horsepower^2), data = Auto)

## Diagnostic Plot
par(mfrow = c(1,2))
plot(lmobj_2, which = 1)
plot(lmobj_3, which = 1)

Danger

The standard formulas for

  • standard errors,
  • confidence intervals, and
  • hypothesis tests

in this chapter are based on the assumption of

  • uncorrelated error terms \(Cor(\epsilon_i,\epsilon_j)=0\) for all \(i\neq j\)
  • with equal variances \(Var(\epsilon_i)=\sigma^2\) for all \(i=1,\dots,n.\)

If in fact there is correlation and/or heteroskedasticity among the error terms, then the estimated standard errors will be wrong leading to invalid inferences.

Thus, if the error terms are auto-correlated and/or heteroskedastic, we need to take this into account by using so-called auto-correlation and/or heteroskedasticity robust standard errors.

The R package sandwich contains such robust standard error estimators.

4. Outliers

An outlier is a point \(i\) for which \(y_i\) is far from the value \(\hat{y}_i\) predicted by the model. Outliers can arise for a variety of reasons, such as incorrect recording of an observation during data collection.

Outliers typically have a strong effect on the \(R^2\) value since they add a very large residual to its computation.

Harmless Outlier: Figure 3.12 shows a clear outlier (observation 20) which, however, has a typical predictor value \(x_i\); i.e. the \(x_i\)-value is right in the center of all predicor values. Such outliers have little effect on the regression fit.

Harmful Outlier: Figure 3.13 shows again a clear outlier (observation 41) which, however, has a predictor value \(x_i\) that is very atypical. Such outliers are said to have large leverage giving them power to affect the regression fit considerably.

Tip

Critical outliers have both,

  • large residuals

and

  • large leverage.

5. High Leverage Points

In order to quantify an observation’s leverage, we compute the leverage statistic \(h_i\) for each observation \(i=1,\dots,n.\) A large value of this statistic indicates an observation with high leverage.

In the case of the simple linear regression model \[ h_i = \frac{1}{n} + \frac{(x_i-\bar{x})^2}{\sum_{i'=1}^n(x_{i'}-\bar{x})^2}. \]

In the case of the multiple linear regression model, \(h_i\) is the \(i\)th diagonal value of the \((n\times n)\)-dimensional “hat-matrix” \[ H=X(X'X)^{-1}X'. \]

  • The leverage statistic \(h_i\) is always between \(1/n\) and \(1\)
  • The average leverage for all the observations is equal to \[ \bar{h}=\frac{1}{n}\sum_{i=1}^n h_i=(p + 1)/n. \]
  • If a given observation has a leverage statistic \(h_i\) that greatly exceeds the average leverage value, \((p+1)/n,\) then we may suspect that the corresponding point has high leverage.

6. Collinearity

Collinearity refers to the situation in which two or more predictor variables are closely related to one another.

In the following example, the variables Age and Limit are essentially unrelated, but the variables Rating and Limit are closely related to one another.

library("ISLR2")
data(Credit) # names(Credit)

par(mfrow=c(1,2))
plot(y = Credit$Age,    x = Credit$Limit, main = "No Collinearity", ylab = "Age", xlab = "Limit")
plot(y = Credit$Rating, x = Credit$Limit, main = "Strong Collinearity", ylab = "Rating", xlab = "Limit")

The left panel of Figure 3.15 shows that, in the case of unrelated predictors (Age and Limit), the least squares problem has a minimum \((\hat\beta_{Age},\hat\beta_{Limit})\) that is well identified since the minimum is well defined.

The right panel of Figure 3.15 shows that, in the case of collinear predictors (Rating and Limit), the least squares problem has a minimum \((\hat\beta_{Rating},\hat\beta_{Limit})\) that is not well identified: One can substitute values of \(\hat\beta_{Limit}'\) for \(\hat\beta_{Rating}'\) ending up in new pairs \((\hat\beta_{Rating}',\hat\beta_{Limit}')\) with basically the same RSS-value than the original value than it is achieved by the minimizer \((\hat\beta_{Rating},\hat\beta_{Limit})\).

Table 3.11 demonstrates that this identification problem between the collinear predictors (Rating and Limit) causes a variance inflation in the variance (square of standard error) of the estimators \(\hat\beta_{Rating}\) and \(\hat\beta_{Limit}.\)

  • In Model 1: \(\hat\beta_{Limit} = 0.005^2=0.000025\)
  • In Model 2: \(\hat\beta_{Rating} = 0.064^2=0.004096\)

We call this situation multicollinearity.

To detect multicollinearity issues, one can use the variance inflation factor (VIF) \[ \operatorname{VIF}(\hat{\beta}_j)=\frac{1}{1-R^2_{X_j|X_-j}}, \] where \(R^2_{X_j|X_-j}\) is the \(R^2\) from a regression of \(X_j\) onto all of the other predictors.

  • If \(R^2_{X_j|X_-j}\) is close to one, then multicollinearity is present, and \(\operatorname{VIF}(\hat{\beta}_j)\) will be large.

In the Credit data, one gets for the predictors age, rating, and limit the following VIF values:

  • 1.01 (age)
  • 160.67 (rating)
  • 160.59 (limit)

Thus, as we suspected, there is considerable collinearity in the data!

Possible solutions:

  1. Drop one of the problematic variables from the regression. This can usually be done without much compromise to the regression fit, since the presence of collinearity implies that the information that this variable provides about the response is redundant in the presence of the other variables.
    Caution: In econometrics, dropping control variables is generally not a good idea since control variables are there to rule out possible issues with omitted variables biases.

  2. Combine the collinear variables together into a single predictor. For instance, we might take the average of standardized versions of limit and rating in order to create a new variable that measures credit worthiness.

  3. Use a different estimation procedure like ridge regression.

  4. Live with it. At least you know where the large stand errors are coming from.

2.4 Comparison of Linear Regression with K-Nearest Neighbors

Linear regression is an example of a parametric approach because it assumes a linear model form for \(f(X).\)

Advantages of parametric approaches:

  • Typically easy to fit
  • Simple interpretation
  • Simple inference

Disadvantages of parametric approaches:

  • The parametric model assumption can be far from true; i.e. \[ f(X) \neq \beta_0+\beta_1X_1+\dots+\beta_pX_p \]

Alternative: Non-parametric methods such as K-nearest neighbors regression since non-parametric approaches do not explicitly assume a parametric form for \(f(X).\)

K-nearest neighbors regression (KNN regression)

Given a value for \(K\) and a prediction point \(x_0,\) KNN regression regression computes \(\hat{f}(x_0)\) in two steps:

  1. identify the \(K\) training observations that are closest to \(x_0\), represented by the index set \(\mathcal{N}_0\subset\{1,2,\dots,n_{Train}\}.\)
  2. estimate \(f(x_0)\) using the average of all the training responses \(y_i\) with \(i\in\mathcal{N}_0,\) i.e.  \[ \hat{f}(x_0)=\frac{1}{K}\sum_{i\in\mathcal{N}_0}y_i. \]

The left panel of Figure 3.16 shows the estimation result for \(K=1\) and the right panel for \(K=9.\)

In general, the optimal value for \(K\) will depend on the bias-variance tradeoff, which we introduced in Chapter 1:

  • A small value for \(K\) provides the most flexible fit, which will have low bias but high variance. This variance is due to the fact that the prediction in a given region is entirely dependent, e.g., on just one observation if \(K=1\).
  • A large value of \(K\) provides a less flexible fit. The prediction in a region is an average of several points, and so changing one observation has a smaller effect. However, the smoothing may cause bias by masking some of the structure in \(f(X).\)

An optimal value of \(K\) can be chosen using, e.g., cross-validation; see Chapter 4.

Generally, the parametric approach will outperform the non-parametric approach if the parametric form that has been selected is close to the true form of \(f\) and vice versa.

Figure 3.17 provides an example with data generated from a one-dimensional linear regression model:

  • black solid lines: true \(f(x)\)
  • blue curves: KNN fits \(\hat{f}(x)\) using \(K = 1\) (left plot) and \(K = 9\) (right plot).

Observations:

  • The KNN fit \(\hat{f}(x)\) using \(K = 1\) is far too wiggly
  • The KNN fit \(\hat{f}(x)\) using \(K = 9\) is much closer to the true \(f(X).\)

However, since the true regression function is here linear, it is hard for a non-parametric approach to compete with simple linear regression: a non-parametric approach incurs a cost in variance that is here not offset by a reduction in bias.

The blue dashed line in the left-hand panel of Figure 3.18 represents the simple linear regression fit to the same data. It is almost perfect. The right-hand panel of Figure 3.18 reveals that linear regression outperforms KNN for this data.

Figure 3.19 displays a non-linear situations in which KNN performs much better than simple linear regression.

Curse of Dimensionality

Unfortunately, in higher dimensions, KNN often performs worse than simple/multiple linear regression, since non-parametric approaches suffer from the curse of dimensionality.

Figure 3.20 considers the same strongly non-linear situation as in the second row of Figure 3.19, except that we have added additional noise (i.e. redundant) predictors that are not associated with the response.

  • When \(p = 1\) or \(p = 2,\) KNN outperforms linear regression.
  • But for \(p = 3\) the results are mixed, and for \(p\geq 4\) linear regression is superior to KNN.

Observations:

  • When \(p=1\), a sample size of \(n=50\) can provide enough information to estimate \(f(X)\) accurately using non-parametric methods since the \(K\) nearest neighbors can actually be close to a given test observation \(x_0.\)
  • However, when spreading the \(n=50\) data points over a large number of, for instance, \(p=20\) dimensions, the \(K\) nearest neighbors tend to become far away from \(x_0.\)

2.5 R-Lab: Linear Regression

2.5.1 Libraries

The library() function is used to load libraries, or groups of functions and data sets that are not included in the base R distribution. Basic functions that perform least squares linear regression and other simple analyses come standard with the base distribution, but more exotic functions require additional libraries.

Here we load the MASS package, which is a very large collection of data sets and functions. We also load the ISLR2 package, which includes the data sets associated with this book.

suppressPackageStartupMessages(library(MASS))
suppressPackageStartupMessages(library(ISLR2))

If you receive an error message when loading any of these libraries, it likely indicates that the corresponding library has not yet been installed on your system. Some libraries, such as MASS, come with R and do not need to be separately installed on your computer. However, other packages, such as ISLR2, must be downloaded the first time they are used.

This can be done, for instance, at the R command line via install.packages("ISLR2") function. This installation only needs to be done the first time you use a package. However, the library() function must be called within each R session.

2.5.2 Simple Linear Regression

The ISLR2 library contains the Boston data set, which records medv (median house value) for \(506\) census tracts in Boston. We will seek to predict medv using \(12\) predictors such as rmvar (average number of rooms per house), age (average age of houses), and lstat (percent of households with low socioeconomic status).

head(Boston)
     crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
  medv
1 24.0
2 21.6
3 34.7
4 33.4
5 36.2
6 28.7

To find out more about the data set, we can type ?Boston.

We will start by using the lm() function to fit a simple linear regression model, with medv as the response and lstat as the predictor. The basic syntax is lm(y ~ x, data), where y is the response, x is the predictor, and data is the data set in which these two variables are kept.

lm.fit <- lm(medv ~ lstat)
Error in eval(predvars, data, env): object 'medv' not found

The command causes an error because R does not know where to find the variables medv and lstat.

The next line tells R that the variables are in Boston:

lm.fit <- lm(medv ~ lstat, data = Boston)

Alternatively, we can attach the Boston object:

attach(Boston)
lm.fit <- lm(medv ~ lstat)

If we type lm.fit, some basic information about the model is output. For more detailed information, we use summary(lm.fit). This gives us \(p\)-values and standard errors for the coefficients, as well as the \(R^2\) statistic and \(F\)-statistic for the model.

lm.fit

Call:
lm(formula = medv ~ lstat)

Coefficients:
(Intercept)        lstat  
      34.55        -0.95  
summary(lm.fit)

Call:
lm(formula = medv ~ lstat)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.168  -3.990  -1.318   2.034  24.500 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 34.55384    0.56263   61.41   <2e-16 ***
lstat       -0.95005    0.03873  -24.53   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.216 on 504 degrees of freedom
Multiple R-squared:  0.5441,    Adjusted R-squared:  0.5432 
F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16

We can use the names() or the str() function in order to find out what other pieces of information are stored in lm.fit.

We can extract these quantities by name—e.g. lm.fit$coefficients.

For some objects, there are also specific extractor functions like coef() to access them.

names(lm.fit)
 [1] "coefficients"  "residuals"     "effects"       "rank"         
 [5] "fitted.values" "assign"        "qr"            "df.residual"  
 [9] "xlevels"       "call"          "terms"         "model"        
coef(lm.fit)
(Intercept)       lstat 
 34.5538409  -0.9500494 

In order to obtain a confidence interval for the coefficient estimates, we can use the confint() command.

Type confint(lm.fit) at the command line to obtain the confidence intervals for the linear regression coefficients.

confint(lm.fit)
                2.5 %     97.5 %
(Intercept) 33.448457 35.6592247
lstat       -1.026148 -0.8739505

The predict() function can be used to produce confidence intervals and prediction intervals for the prediction of medv for a given value of lstat.

predict(lm.fit, data.frame(lstat = (c(5, 10, 15))), 
        interval = "confidence")
       fit      lwr      upr
1 29.80359 29.00741 30.59978
2 25.05335 24.47413 25.63256
3 20.30310 19.73159 20.87461

For instance, the observed value of the 95% confidence interval associated with a lstat value of \(10,\) i.e. associated with \[ \beta_0 + \beta_1 \cdot 10 \] is \[ \operatorname{CI}_{\beta_0 + \beta_1 \cdot 10,obs}=[24.47, 25.63] \] with observed prediction value \[ 25.05335 = \hat\beta_0 + \hat\beta_1 \cdot 10. \]

We will now plot medv and lstat along with the least squares regression line using the plot() and abline() functions.

plot(lstat, medv)
abline(lm.fit)

There is some evidence for non-linearity in the relationship between lstat and medv. We will explore this issue later in this lab.

The abline() function can be used to draw any line, not just the least squares regression line. To draw a line with intercept a and slope b, we type abline(a, b). Below we experiment with some additional settings for plotting lines and points. The lwd = 3 command causes the width of the regression line to be increased by a factor of 3; this works for the plot() and lines() functions also. We can also use the pch option to create different plotting symbols.

plot(lstat, medv)
abline(lm.fit, lwd = 3, col = "red")

plot(lstat, medv, col = "red")

plot(lstat, medv, pch = 20)

plot(lstat, medv, pch = "+")

plot(1:20, 1:20, pch = 1:20)

Next we examine some diagnostic plots. Four diagnostic plots are automatically produced by applying the plot() function directly to the output from lm(). In general, this command will produce one plot at a time, and hitting Enter will generate the next plot. However, it is often convenient to view all four plots together. We can achieve this by using the par() and mfrow() functions, which tell R to split the display screen into separate panels so that multiple plots can be viewed simultaneously. For example, par(mfrow = c(2, 2)) divides the plotting region into a \(2 \times 2\) grid of panels.

par(mfrow = c(2, 2))
plot(lm.fit)

Alternatively, we can compute the residuals from a linear regression fit using the residuals() function. The function rstudent() will return the studentized residuals, and we can use this function to plot the residuals against the fitted values.

plot(predict(lm.fit), residuals(lm.fit))

plot(predict(lm.fit), rstudent(lm.fit))

On the basis of the residual plots, there is some evidence of non-linearity.

Leverage statistics can be computed using the hatvalues() function.

plot(hatvalues(lm.fit))

which.max(hatvalues(lm.fit))
375 
375 

The which.max() function identifies the index of the largest element of a vector. In this case, it tells us which observation has the largest leverage statistic.

sort(hatvalues(lm.fit), decreasing = TRUE)[1:3]
       375        415        374 
0.02686517 0.02495670 0.02097101 

The sort() function can be used to sort and print values of a vector like hatvalues(lm.fit).

2.5.3 Multiple Linear Regression

In order to fit a multiple linear regression model using least squares, we again use the lm() function. The syntax lm(y ~ x1 + x2 + x3) is used to fit a model with three predictors, x1, x2, and x3. The summary() function now outputs the regression coefficients for all the predictors.

lm.fit <- lm(medv ~ lstat + age, data = Boston)
summary(lm.fit)

Call:
lm(formula = medv ~ lstat + age, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.981  -3.978  -1.283   1.968  23.158 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 33.22276    0.73085  45.458  < 2e-16 ***
lstat       -1.03207    0.04819 -21.416  < 2e-16 ***
age          0.03454    0.01223   2.826  0.00491 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.173 on 503 degrees of freedom
Multiple R-squared:  0.5513,    Adjusted R-squared:  0.5495 
F-statistic:   309 on 2 and 503 DF,  p-value: < 2.2e-16

The Boston data set contains 12 variables, and so it would be cumbersome to have to type all of these in order to perform a regression using all of the predictors. Instead, we can use the following short-hand:

lm.fit <- lm(medv ~ ., data = Boston)
summary(lm.fit)

Call:
lm(formula = medv ~ ., data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.595  -2.730  -0.518   1.777  26.199 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.646e+01  5.103e+00   7.144 3.28e-12 ***
crim        -1.080e-01  3.286e-02  -3.287 0.001087 ** 
zn           4.642e-02  1.373e-02   3.382 0.000778 ***
indus        2.056e-02  6.150e-02   0.334 0.738288    
chas         2.687e+00  8.616e-01   3.118 0.001925 ** 
nox         -1.777e+01  3.820e+00  -4.651 4.25e-06 ***
rm           3.810e+00  4.179e-01   9.116  < 2e-16 ***
age          6.922e-04  1.321e-02   0.052 0.958229    
dis         -1.476e+00  1.995e-01  -7.398 6.01e-13 ***
rad          3.060e-01  6.635e-02   4.613 5.07e-06 ***
tax         -1.233e-02  3.760e-03  -3.280 0.001112 ** 
ptratio     -9.527e-01  1.308e-01  -7.283 1.31e-12 ***
black        9.312e-03  2.686e-03   3.467 0.000573 ***
lstat       -5.248e-01  5.072e-02 -10.347  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.745 on 492 degrees of freedom
Multiple R-squared:  0.7406,    Adjusted R-squared:  0.7338 
F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16

We can access the individual components of a summary object by name (type ?summary.lm to see what is available). Hence summary(lm.fit)$r.sq gives us the \(R^2\), and summary(lm.fit)$sigma gives us the RSE. The vif() function, part of the car package, can be used to compute variance inflation factors. Most VIF’s are low to moderate for this data. The car package is not part of the base R installation so it must be downloaded the first time you use it via the install.packages() function in R.

suppressPackageStartupMessages(library(car)) # contains the vif() function
sort(vif(lm.fit)) # computes the VIF statistics and sorts them
    chas    black     crim  ptratio       rm       zn    lstat      age 
1.073995 1.348521 1.792192 1.799084 1.933744 2.298758 2.941491 3.100826 
     dis    indus      nox      rad      tax 
3.955945 3.991596 4.393720 7.484496 9.008554 

What if we would like to perform a regression using all of the variables but one? For example, in the above regression output, age has a high \(p\)-value. So we may wish to run a regression excluding this predictor. The following syntax results in a regression using all predictors except age.

lm.fit1 <- lm(medv ~ . - age, data = Boston)
summary(lm.fit1)

Call:
lm(formula = medv ~ . - age, data = Boston)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.6054  -2.7313  -0.5188   1.7601  26.2243 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  36.436927   5.080119   7.172 2.72e-12 ***
crim         -0.108006   0.032832  -3.290 0.001075 ** 
zn            0.046334   0.013613   3.404 0.000719 ***
indus         0.020562   0.061433   0.335 0.737989    
chas          2.689026   0.859598   3.128 0.001863 ** 
nox         -17.713540   3.679308  -4.814 1.97e-06 ***
rm            3.814394   0.408480   9.338  < 2e-16 ***
dis          -1.478612   0.190611  -7.757 5.03e-14 ***
rad           0.305786   0.066089   4.627 4.75e-06 ***
tax          -0.012329   0.003755  -3.283 0.001099 ** 
ptratio      -0.952211   0.130294  -7.308 1.10e-12 ***
black         0.009321   0.002678   3.481 0.000544 ***
lstat        -0.523852   0.047625 -10.999  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.74 on 493 degrees of freedom
Multiple R-squared:  0.7406,    Adjusted R-squared:  0.7343 
F-statistic: 117.3 on 12 and 493 DF,  p-value: < 2.2e-16

Alternatively, the update() function can be used.

lm.fit1 <- update(lm.fit, ~ . - age)

2.5.4 Interaction Terms

It is easy to include interaction terms in a linear model using the lm() function. The syntax lstat:black tells R to include an interaction term between lstat and black. The syntax lstat * age simultaneously includes lstat, age, and the interaction term lstat\(\times\)age as predictors; it is a shorthand for lstat + age + lstat:age. %We can also pass in transformed versions of the predictors.

summary(lm(medv ~ lstat * age, data = Boston))

Call:
lm(formula = medv ~ lstat * age, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.806  -4.045  -1.333   2.085  27.552 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 36.0885359  1.4698355  24.553  < 2e-16 ***
lstat       -1.3921168  0.1674555  -8.313 8.78e-16 ***
age         -0.0007209  0.0198792  -0.036   0.9711    
lstat:age    0.0041560  0.0018518   2.244   0.0252 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.149 on 502 degrees of freedom
Multiple R-squared:  0.5557,    Adjusted R-squared:  0.5531 
F-statistic: 209.3 on 3 and 502 DF,  p-value: < 2.2e-16

2.5.5 Non-linear Transformations of the Predictors

The lm() function can also accommodate non-linear transformations of the predictors. For instance, given a predictor \(X\), we can create a predictor \(X^2\) using I(X^2). The function I() is needed since the ^ has a special meaning in a formula object; wrapping as we do allows the standard usage in R, which is to raise X to the power 2. We now perform a regression of medv onto lstat and lstat^2.

lm.fit2 <- lm(medv ~ lstat + I(lstat^2))
summary(lm.fit2)

Call:
lm(formula = medv ~ lstat + I(lstat^2))

Residuals:
     Min       1Q   Median       3Q      Max 
-15.2834  -3.8313  -0.5295   2.3095  25.4148 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 42.862007   0.872084   49.15   <2e-16 ***
lstat       -2.332821   0.123803  -18.84   <2e-16 ***
I(lstat^2)   0.043547   0.003745   11.63   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.524 on 503 degrees of freedom
Multiple R-squared:  0.6407,    Adjusted R-squared:  0.6393 
F-statistic: 448.5 on 2 and 503 DF,  p-value: < 2.2e-16

The near-zero \(p\)-value associated with the quadratic term suggests that it leads to an improved model. We use the anova() function to further quantify the extent to which the quadratic fit is superior to the linear fit.

lm.fit <- lm(medv ~ lstat)
anova(lm.fit, lm.fit2)
Analysis of Variance Table

Model 1: medv ~ lstat
Model 2: medv ~ lstat + I(lstat^2)
  Res.Df   RSS Df Sum of Sq     F    Pr(>F)    
1    504 19472                                 
2    503 15347  1    4125.1 135.2 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here Model 1 represents the linear submodel containing only one predictor, lstat, while Model 2 corresponds to the larger quadratic model that has two predictors, lstat and lstat^2. The anova() function performs a hypothesis test comparing the two models. The null hypothesis is that the two models fit the data equally well, and the alternative hypothesis is that the full model is superior.

Here the \(F\)-statistic is \(135\) and the associated \(p\)-value is virtually zero. This provides very clear evidence that the model containing the predictors lstat and lstat^2 is far superior to the model that only contains the predictor lstat.

This is not surprising, since earlier we saw evidence for non-linearity in the relationship between medv and lstat.

If we type

par(mfrow = c(2, 2))
plot(lm.fit2)

then we see that when the lstat^2 term is included in the model, there is little discernible pattern in the residuals.

In order to create a cubic fit, we can include a predictor of the form I(X^3). However, this approach can start to get cumbersome for higher-order polynomials. A better approach involves using the poly() function to create the polynomial within lm(). For example, the following command produces a fifth-order polynomial fit:

lm.fit5 <- lm(medv ~ poly(lstat, 5))
summary(lm.fit5)

Call:
lm(formula = medv ~ poly(lstat, 5))

Residuals:
     Min       1Q   Median       3Q      Max 
-13.5433  -3.1039  -0.7052   2.0844  27.1153 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       22.5328     0.2318  97.197  < 2e-16 ***
poly(lstat, 5)1 -152.4595     5.2148 -29.236  < 2e-16 ***
poly(lstat, 5)2   64.2272     5.2148  12.316  < 2e-16 ***
poly(lstat, 5)3  -27.0511     5.2148  -5.187 3.10e-07 ***
poly(lstat, 5)4   25.4517     5.2148   4.881 1.42e-06 ***
poly(lstat, 5)5  -19.2524     5.2148  -3.692 0.000247 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.215 on 500 degrees of freedom
Multiple R-squared:  0.6817,    Adjusted R-squared:  0.6785 
F-statistic: 214.2 on 5 and 500 DF,  p-value: < 2.2e-16

This suggests that including additional polynomial terms, up to fifth order, leads to an improvement in the model fit! However, further investigation of the data reveals that no polynomial terms beyond fifth order have significant \(p\)-values in a regression fit.

By default, the poly() function orthogonalizes the predictors: this means that the features output by this function are not simply a sequence of powers of the argument. However, a linear model applied to the output of the poly() function will have the same fitted values as a linear model applied to the raw polynomials (although the coefficient estimates, standard errors, and p-values will differ). In order to obtain the raw polynomials from the poly() function, the argument raw = TRUE must be used.

Of course, we are in no way restricted to using polynomial transformations of the predictors. Here we try a log transformation.

summary(lm(medv ~ log(rm), data = Boston))

Call:
lm(formula = medv ~ log(rm), data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-19.487  -2.875  -0.104   2.837  39.816 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -76.488      5.028  -15.21   <2e-16 ***
log(rm)       54.055      2.739   19.73   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.915 on 504 degrees of freedom
Multiple R-squared:  0.4358,    Adjusted R-squared:  0.4347 
F-statistic: 389.3 on 1 and 504 DF,  p-value: < 2.2e-16

2.5.6 Qualitative Predictors

We will now examine the Carseats data, which is part of the ISLR2 library. We will attempt to predict Sales (child car seat sales) in \(400\) locations based on a number of predictors.

head(Carseats)
  Sales CompPrice Income Advertising Population Price ShelveLoc Age Education
1  9.50       138     73          11        276   120       Bad  42        17
2 11.22       111     48          16        260    83      Good  65        10
3 10.06       113     35          10        269    80    Medium  59        12
4  7.40       117    100           4        466    97    Medium  55        14
5  4.15       141     64           3        340   128       Bad  38        13
6 10.81       124    113          13        501    72       Bad  78        16
  Urban  US
1   Yes Yes
2   Yes Yes
3   Yes Yes
4   Yes Yes
5   Yes  No
6    No Yes

The Carseats data includes qualitative predictors such as shelveloc, an indicator of the quality of the shelving location—that is, the space within a store in which the car seat is displayed—at each location. The predictor shelveloc takes on three possible values: Bad, Medium, and Good. Given a qualitative variable such as shelveloc, R generates dummy variables automatically. Below we fit a multiple regression model that includes some interaction terms.

lm.fit <- lm(Sales ~ . + Income:Advertising + Price:Age, 
    data = Carseats)
summary(lm.fit)

Call:
lm(formula = Sales ~ . + Income:Advertising + Price:Age, data = Carseats)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.9208 -0.7503  0.0177  0.6754  3.3413 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)         6.5755654  1.0087470   6.519 2.22e-10 ***
CompPrice           0.0929371  0.0041183  22.567  < 2e-16 ***
Income              0.0108940  0.0026044   4.183 3.57e-05 ***
Advertising         0.0702462  0.0226091   3.107 0.002030 ** 
Population          0.0001592  0.0003679   0.433 0.665330    
Price              -0.1008064  0.0074399 -13.549  < 2e-16 ***
ShelveLocGood       4.8486762  0.1528378  31.724  < 2e-16 ***
ShelveLocMedium     1.9532620  0.1257682  15.531  < 2e-16 ***
Age                -0.0579466  0.0159506  -3.633 0.000318 ***
Education          -0.0208525  0.0196131  -1.063 0.288361    
UrbanYes            0.1401597  0.1124019   1.247 0.213171    
USYes              -0.1575571  0.1489234  -1.058 0.290729    
Income:Advertising  0.0007510  0.0002784   2.698 0.007290 ** 
Price:Age           0.0001068  0.0001333   0.801 0.423812    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.011 on 386 degrees of freedom
Multiple R-squared:  0.8761,    Adjusted R-squared:  0.8719 
F-statistic:   210 on 13 and 386 DF,  p-value: < 2.2e-16

The contrasts() function returns the coding that R uses for the dummy variables.

attach(Carseats)
contrasts(ShelveLoc)
       Good Medium
Bad       0      0
Good      1      0
Medium    0      1

Use ?contrasts to learn about other contrasts, and how to set them.

R has created a ShelveLocGood dummy variable that takes on a value of 1 if the shelving location is good, and 0 otherwise. It has also created a ShelveLocMedium dummy variable that equals 1 if the shelving location is medium, and 0 otherwise. A bad shelving location corresponds to a zero for each of the two dummy variables. The fact that the coefficient for ShelveLocGood in the regression output is positive indicates that a good shelving location is associated with high sales (relative to a bad location). And ShelveLocMedium has a smaller positive coefficient, indicating that a medium shelving location is associated with higher sales than a bad shelving location but lower sales than a good shelving location.

2.5.7 Writing Functions

As we have seen, R comes with many useful functions, and still more functions are available by way of R libraries. However, we will often be interested in performing an operation for which no function is available. In this setting, we may want to write our own function. For instance, below we provide a simple function that reads in the ISLR2 and MASS libraries, called LoadLibraries(). Before we have created the function, R returns an error if we try to call it.

LoadLibraries
Error in eval(expr, envir, enclos): object 'LoadLibraries' not found
LoadLibraries()
Error in LoadLibraries(): could not find function "LoadLibraries"

We now create the function.

LoadLibraries <- function() {
 library(ISLR2)
 library(MASS)
 print("The libraries have been loaded.")
}

Now if we type in LoadLibraries, R will tell us what is in the function.

LoadLibraries
function() {
 library(ISLR2)
 library(MASS)
 print("The libraries have been loaded.")
}

If we call the function, the libraries are loaded in and the print statement is output.

LoadLibraries()
[1] "The libraries have been loaded."

2.6 Exercises

Prepare the following exercises of Chapter 3 in our course textbook ISLR:

  • Exercise 1
  • Exercise 2
  • Exercise 3
  • Exercise 8
  • Exercise 9

2.7 Solutions

Exercise 1

1 a) Describe the null hypotheses to which the \(p\)-values given in Table 3.4 correspond.

1 b) Explain what conclusions you can draw based on these \(p\)-values. Your explanation should be phrased in terms of sales, TV, radio, and newspaper, rather than in terms of the coefficients of the linear model.

Answers:

1 a) In Table 3.4, the null hypothesis for TV is that in the presence of radio ads and newspaper ads, TV ads have no effect on sales. Similarly, the null hypothesis for radio is that in the presence of TV ads and newspaper ads, radio ads have no effect on sales.

1 b) The low p-values of TV and radio allow us to reject the “no effect” null hypotheses for TV and radio. Hence, we believe that

  • TV ads have an effect on sales in the presence of radio and newspaper ads.
  • radio ads have an effect on sales in the presence of TV and newspaper ads.

The high p-value of newspaper does not allow us to reject the “no effect” null-hypothesis. This constitutes an inconclusive result and only says that the possible effects of newspaper ads are not large enough to stand out from the estimation errors.

Remember

An insignificant hypothesis test result is never informative about whether the tested null hypothesis is true. We do not have an error-control for falsely accepting the null-hypothesis, i.e. for type-II-errors. We only have an error-control (by the significance level) for falsely rejecting the null-hypothesis, i.e. for type-I-errors.

Exercise 2

Carefully explain the main difference between the KNN classifier and KNN regression methods.

Answer:

KNN classifier and KNN regression methods are closely related in formula. However, the final result of KNN classifier is the classification output for \(Y\) (qualitative), given a certain predictor \(x_0\), where as the output for a KNN regression predicts the quantitative value for \(f(x_0)\), given a certain predictor \(x_0\).

Exercise 3

Suppose we have a data set with five predictors:

\(X_1 =GPA\)

\(X_2 = IQ\)

\(X_3 = Gender\) (\(1\) for Female and \(0\) for Male)

\(X_4 =\) Interaction between \(GPA\) and \(IQ\)

\(X_5 =\) Interaction between \(GPA\) and \(Gender\)

The response variable (in thousands of dollars) is defined as:

\(Y =\) starting salary after graduation

Suppose we use least squares to fit the model, and get:

\(\hat{\beta}_0 = 50\), \(\hat{\beta}_1 = 20\), \(\hat{\beta}_2 = 0.07\), \(\hat{\beta}_3 = 35\), \(\hat{\beta}_4 = 0.01\), and \(\hat{\beta}_5 = −10\).

Thus we have:

\[ \begin{align*} &E[Y|X] = \\ & 50 + 20\,\overbrace{GPA}^{X_1} + 0.07\,\overbrace{IQ}^{X_2} + 35\,\overbrace{Gender}^{X_3} + 0.01\,\overbrace{GPA\cdot IQ}^{X_4=X_1\cdot X_2} - 10\,\overbrace{GPA\cdot Gender}^{X_5=X_1\cdot X_3} \end{align*} \]

3 a) Which answer is correct, and why?

  1. For a fixed value of \(IQ\) and \(GPA\), males earn more on average than females.
  2. For a fixed value of \(IQ\) and \(GPA\), females earn more on average than males.
  3. For a fixed value of \(IQ\) and \(GPA\), males earn more on average than females provided that the \(GPA\) is high enough.
  4. For a fixed value of \(IQ\) and \(GPA\), females earn more on average than males provided that the \(GPA\) is high enough.

Answer: Observe that: \[ \begin{align*} \text{Male\; $(X_3 = 0)$:}\quad & 50 + 20 X_1 + 0.07 X_2 + \phantom{3}0 + 0.01\,(X_1 \cdot X_2) -0 \\[1.5ex] \text{Female\; $(X_3 = 1)$:}\quad & 50 + 20 X_1 + 0.07 X_2 + 35 + 0.01(X_1 \cdot X_2) - 10\,X_1 \end{align*} \]

Thus 3 a) iii. is correct, since once the \(X_1=\)GPA is high enough (\(35-10\,X_1<0 \Leftrightarrow X_1>3.5\)), males earn more on average.

3 b) Predict the salary of a female with IQ of 110 and a GPA of 4.0.

Answer:

GPA    <-   4
IQ     <- 110
Gender <-   1 # female = 1
## Prediction
Y_hat  <- 50 + 20*GPA + 0.07*IQ + 35*Gender + 0.01*GPA*IQ - 10*GPA
Y_hat
[1] 137.1

3 c) True or false: Since the coefficient for the GPA\(\times\)IQ interaction term is very small, there is very little evidence of an interaction effect. Justify your answer.

Answer:

False. We must examine the \(p\)-value (or the \(t\)-statistic) of the regression coefficient to determine if the interaction term is statistically significant or not.

Exercise 8

This question involves the use of simple linear regression on the Auto data set.

8 a) Use the lm() function to perform a simple linear regression with mpg as the response and horsepower as the predictor. Use the summary() function to print the results.

library("ISLR2")

data("Auto")

# Perform linear regression
lmObj_1 <- lm(mpg ~ horsepower, data=Auto)

# Use summary function to print the results
summary(lmObj_1)

Call:
lm(formula = mpg ~ horsepower, data = Auto)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.5710  -3.2592  -0.3435   2.7630  16.9240 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 39.935861   0.717499   55.66   <2e-16 ***
horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.906 on 390 degrees of freedom
Multiple R-squared:  0.6059,    Adjusted R-squared:  0.6049 
F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

Comment on the output. For example:

i) Is there a relationship between the predictor and the response?

Answer:

Yes, there is. The predictor horsepower has a statistically significant (\(p<0.001\)) linear relationship with the response.

ii) How strong is the relationship between the predictor and the response?

Answer:

Statistical significance does not necessarily mean a practically strong or important relationship.

To quantify the strength of the relationship between the predictor and the response, we can look at the following quantities:

  • Residual Standard Error (RSE) (estimate of the standard deviation of \(\epsilon\)) in comparison to the RSE of the trivial linear regression model with only an intercept.
  • The \(R^2\) Statistic (the proportion of variance explained by the model)
  • The \(F\)-Statistic

The Residual Standard Error (RSE) of the regression model with intercept and horsepower as predictors is given by:

## RSE of lm(mpg ~ horsepower):
RSS <- sum(resid(lmObj_1)^2)
n   <- length(resid(lmObj_1))
RSE <- sqrt(RSS/(n-2))
round(RSE, 3)
[1] 4.906
## Alternatively: 
round(summary(lmObj_1)$sigma, 3)
[1] 4.906

This RSE value is considerable smaller than the RSE of a model with only an intercept:

lmObj_onlyIntercept <- lm(mpg ~ +1, data = Auto)
RSS_onlyIntercept   <- sum(resid(lmObj_onlyIntercept)^2)
n                   <- length(resid(lmObj_onlyIntercept))
RSE_onlyIntercept   <- sqrt(RSS_onlyIntercept/(n-1))
round(RSE_onlyIntercept, 3)
[1] 7.805

Thus, the larger model with horsepower included explains more of the variances in the response variable mpg. Including horsepower as a predictor reduces the RSE by ((RSE_onlyIntercept - RSE)/RSE_onlyIntercept)*100 %; i.e. by 37.15%.

The \(R^2\) value:

round(summary(lmObj_1)$r.squared, 2)
[1] 0.61

shows that \(60\%\) of variability in \(Y\) can be explained using an intercept and horsepower as predictors.

The value of the \(F\) statistic ::: {.cell}

round(summary(lmObj_1)$fstatistic, 2)
 value  numdf  dendf 
599.72   1.00 390.00 

::: is much larger than \(1\) which means that the linear regression model with intercept and horsepower fits the data significantly better than the trivial regression model with only an intercept.

iii) Is the relationship between the predictor and the response positive or negative?

Answer:

The relationship is negative, as we can see from the parameter estimate for horsepower

coef(lmObj_1)[2]
horsepower 
-0.1578447 

iv) What is the predicted mpg associated with a horsepower of \(98\)? What is the associated \(95\%\) confidence interval?

Answer:

The predicted value plus confidence interval:

# Horsepower of 98
new_df <- data.frame(horsepower = 98)

# confidence interval 
predict(object = lmObj_1, newdata = new_df, interval = "confidence")
       fit      lwr      upr
1 24.46708 23.97308 24.96108

8 b) Plot the response and the predictor. Use the abline() function to display the least squares regression line.

Answer:

plot(x = Auto$horsepower, y = Auto$mpg, ylab = "MPG", xlab = "Horsepower")
abline(lmObj_1, col="blue")
legend("topright", 
       legend = c("(y,x)", expression(paste("(",hat(y),",x)"))), 
       pch=c(1,NA), lty=c(NA,1), col=c("black", "blue"))

8 c) Use the plot() function to produce diagnostic plots of the least squares regression fit. Comment on any problems you see with the fit.

Answer:

par(mfrow=c(2,2))
plot(lmObj_1, col='blue')

Looking at the smoothing line of the residuals (\(e_i=y_i−\hat{y}_i\)) vs. the fitted values (\(\hat{y}_i\)), there is a strong pattern in the residuals, indicating non-linearity. You can see evidence of this also in the scatter plot in the answer for question 8 b).

There also appears to be non-constant variance in the error terms (heteroscedasticity), but this may be corrected to an extent when trying a quadratic fit. If not, transformations such as \(log(y)\) or \(\sqrt{y}\) can shrink larger responses by a greater amount and reduce this issue.

There are some observations with large standardized residuals & high leverage (hence, high Cook’s Distance) that we need to review.

Exercise 9

This question involves the use of multiple linear regression on the Auto data set.

9 a) Produce a scatterplot matrix which includes all of the variables in the data set.

Answer:

library("ISLR2")

data("Auto")

# Produce scatterplot matrix
pairs(Auto)

9 b) Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, which is qualitative.

Answer:

round(cor(subset(Auto, select = -name)), 1)
              mpg cylinders displacement horsepower weight acceleration year
mpg           1.0      -0.8         -0.8       -0.8   -0.8          0.4  0.6
cylinders    -0.8       1.0          1.0        0.8    0.9         -0.5 -0.3
displacement -0.8       1.0          1.0        0.9    0.9         -0.5 -0.4
horsepower   -0.8       0.8          0.9        1.0    0.9         -0.7 -0.4
weight       -0.8       0.9          0.9        0.9    1.0         -0.4 -0.3
acceleration  0.4      -0.5         -0.5       -0.7   -0.4          1.0  0.3
year          0.6      -0.3         -0.4       -0.4   -0.3          0.3  1.0
origin        0.6      -0.6         -0.6       -0.5   -0.6          0.2  0.2
             origin
mpg             0.6
cylinders      -0.6
displacement   -0.6
horsepower     -0.5
weight         -0.6
acceleration    0.2
year            0.2
origin          1.0

9 c) Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output by answering the below questions 9 c i) to 9 c iii).

Answer:

# Perform multiplie linear regression
fit.lm <- lm(mpg ~ . -name, data=Auto)

# Print results
summary(fit.lm)

Call:
lm(formula = mpg ~ . - name, data = Auto)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.5903 -2.1565 -0.1169  1.8690 13.0604 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
cylinders     -0.493376   0.323282  -1.526  0.12780    
displacement   0.019896   0.007515   2.647  0.00844 ** 
horsepower    -0.016951   0.013787  -1.230  0.21963    
weight        -0.006474   0.000652  -9.929  < 2e-16 ***
acceleration   0.080576   0.098845   0.815  0.41548    
year           0.750773   0.050973  14.729  < 2e-16 ***
origin         1.426141   0.278136   5.127 4.67e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.328 on 384 degrees of freedom
Multiple R-squared:  0.8215,    Adjusted R-squared:  0.8182 
F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

9 c i) Is there a relationship between the predictors and the response?

Answer:

Yes, there is a relationship between the predictors and the response. By testing the null hypothesis of whether all (except intercept) the regression coefficients are zero (i.e. H\(_0\): \(\beta_1=\dots=\beta_7=0\)), we can see that the \(F\)-statistic is big and its \(p\)-value is close to zero, indicating evidence against the null hypothesis.

9 c ii) Which predictors appear to have a statistically significant relationship to the response?

Answer:

Looking at the \(p\)-values associated with each predictor’s \(t\)-statistic, we see that displacement, weight, year, and origin have a statistically significant relationship, while cylinders, horsepower, and acceleration do not.

Caution: This consideration neglects issues due to multiple testing. When testing at the significance level \(\alpha=0.05\), then each single test has a type I error (false H\(_0\) rejections) rate of up to \(5\%\). These type I error rates accumulate since we consider seven hypothesis tests simultaneously, and thus the probability of seeing one type I error among the seven tests is up to \(7\cdot 5\%=35\%\). So is quite likely to see one type I error.

Bonferroni correction for multiple testing: To determine if any of the seven predictors is statistically significant, the corresponding \(p\)-value must be smaller than \(\alpha/7\). For instance, with \(\alpha/7=0.05/7\approx 0.007\), only weight, year, and origin have a statistically significant relationships to the response.

9 c iii) What does the coefficient for the year variable suggest?

Answer:

The regression coefficient for year suggests that, on average, one year later year-of-construction is associated with an increased mpg by \(0.75\), when holding every other predictor value constant.

9 d) Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

Answer:

par(mfrow=c(4,1))
plot(fit.lm)

  • The “Residuals vs Fitted” plot (1st plot) shows some systematic deviations of the residuals from \(0\). The reason is that we are imposing a straight “line” (better hyper plane) fit for the conditional mean function \(E[Y|X]=f(X)\) which appears non-linear here. This results in a systematic underestimation of the true conditional mean function for large and small fitted values \(\hat{y}=\hat\beta_0+\hat\beta_1x_1+\dots+\hat\beta_px_p\).

  • The “Normal Q-Q” plot (2nd plot) suggests non-normally distributed residuals–particularly the upper tail deviates from that of a normal distribution.

  • The “Residuals vs Leverage” plot (3rd plot) shows that there are some potential outliers that we can see when: standardized residuals are below \(-2\) or above \(+2\). Moreover, the plot shows also potentially problematic “high-leverage” points with leverage values heavily exceeding the rule-of-thumb threshold \((p+1)/n=8/392=0.02\). All points with simultaneously high-leverages and large absolute standardized residuals should be handled with care since these may distort the estimation.

  • The “Scale-Location” plot (4th plot) shows is rather inconclusive about heteroscedasticity. However the “Residuals vs Fitted” plot (1st plot)shows some clear sign of heteroscedastic residuals.

9 e) Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

Answer:

Violating the hierarchy principle:

fit.lm0 <- lm(mpg ~ horsepower+cylinders+year+weight:displacement, 
              data=Auto)
summary(fit.lm0)

Call:
lm(formula = mpg ~ horsepower + cylinders + year + weight:displacement, 
    data = Auto)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.1046 -2.8861 -0.2415  2.3967 15.3221 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)         -1.343e+01  5.043e+00  -2.663  0.00807 ** 
horsepower          -3.914e-02  1.278e-02  -3.063  0.00234 ** 
cylinders           -1.358e+00  3.233e-01  -4.201 3.31e-05 ***
year                 6.661e-01  6.019e-02  11.067  < 2e-16 ***
weight:displacement -3.354e-06  1.352e-06  -2.480  0.01355 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.985 on 387 degrees of freedom
Multiple R-squared:  0.7419,    Adjusted R-squared:  0.7393 
F-statistic: 278.2 on 4 and 387 DF,  p-value: < 2.2e-16

Following the hierarchical principle:

fit.lm1 <- lm(mpg~horsepower+cylinders+year+weight*displacement, 
              data=Auto)
summary(fit.lm1)

Call:
lm(formula = mpg ~ horsepower + cylinders + year + weight * displacement, 
    data = Auto)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.7530 -1.8228 -0.0602  1.5780 12.6133 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)         -2.210e+00  3.819e+00  -0.579  0.56316    
horsepower          -3.396e-02  9.560e-03  -3.552  0.00043 ***
cylinders            2.072e-01  2.914e-01   0.711  0.47756    
year                 7.858e-01  4.555e-02  17.250  < 2e-16 ***
weight              -1.084e-02  6.346e-04 -17.076  < 2e-16 ***
displacement        -7.947e-02  9.905e-03  -8.023 1.26e-14 ***
weight:displacement  2.431e-05  2.141e-06  11.355  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.976 on 385 degrees of freedom
Multiple R-squared:  0.8568,    Adjusted R-squared:  0.8546 
F-statistic: 384.1 on 6 and 385 DF,  p-value: < 2.2e-16

Note that there is a difference between using A:B and A*B when running a regression. While the first includes only the interaction term between the variable A and B, the second one also includes the stand-alone variables A and B.

Generally, you should follow the hierarchical principle for interaction effects: If we include an interaction in a model, we should also include the main effects, even if the \(p\)-values associated with their coefficients are not significant.

9 f)

Try a few different transformations of the variables, such as \(\log(X)\), \(\sqrt{X}\), \(X^2\). Comment on your findings.

Answer:

fit.lm2 <- lm(mpg~log(weight)+sqrt(horsepower)+
                acceleration+I(acceleration^2),
              data=Auto)
summary(fit.lm2)

Call:
lm(formula = mpg ~ log(weight) + sqrt(horsepower) + acceleration + 
    I(acceleration^2), data = Auto)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.2932  -2.5082  -0.2237   2.0237  15.7650 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       178.30303   10.80451  16.503  < 2e-16 ***
log(weight)       -14.74259    1.73994  -8.473 5.06e-16 ***
sqrt(horsepower)   -1.85192    0.36005  -5.144 4.29e-07 ***
acceleration       -2.19890    0.63903  -3.441 0.000643 ***
I(acceleration^2)   0.06139    0.01857   3.305 0.001037 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.99 on 387 degrees of freedom
Multiple R-squared:  0.7414,    Adjusted R-squared:  0.7387 
F-statistic: 277.3 on 4 and 387 DF,  p-value: < 2.2e-16
##
par(mfrow=c(4,1))
plot(fit.lm2)

This try suffers basically from the same issues as the model considered in 9 d)

Let’s consider again the model with all predictors (except name), but with transforming the outcome variable mpg by a \(\log\)-transformation.

fit.lm3 <-lm(log(mpg)~ . -name, data=Auto)
summary(fit.lm3)

Call:
lm(formula = log(mpg) ~ . - name, data = Auto)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.40955 -0.06533  0.00079  0.06785  0.33925 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1.751e+00  1.662e-01  10.533  < 2e-16 ***
cylinders    -2.795e-02  1.157e-02  -2.415  0.01619 *  
displacement  6.362e-04  2.690e-04   2.365  0.01852 *  
horsepower   -1.475e-03  4.935e-04  -2.989  0.00298 ** 
weight       -2.551e-04  2.334e-05 -10.931  < 2e-16 ***
acceleration -1.348e-03  3.538e-03  -0.381  0.70339    
year          2.958e-02  1.824e-03  16.211  < 2e-16 ***
origin        4.071e-02  9.955e-03   4.089 5.28e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1191 on 384 degrees of freedom
Multiple R-squared:  0.8795,    Adjusted R-squared:  0.8773 
F-statistic: 400.4 on 7 and 384 DF,  p-value: < 2.2e-16
##
par(mfrow=c(4,1))
plot(fit.lm3)

This model specification is much better!

  • No clear issues of systematic under/over estimations for given fitted values.
  • No clear issues of heteroscedastic residuals.
  • Normality assumption may be wrong, but this isn’t problematic since we have a large dataset, such that a central limit theorem will make the estimators asymptotically normal distributed.
  • One large leverage point which, however, has a small residual.