The linear regression model assumes a linear relationship between \(Y\) and the predictor(s) \(X\).
The simple (only one predictor) linear regression model: \[
Y\approx \beta_0 + \beta_1 X
\]
For instance,
sales\(\approx \beta_0 + \beta_1\)TV
2.1.1 Estimating the Regression Coefficients
Let \[
(X_{1},Y_1),\dots,(X_{n},Y_n)
\] denote a training data random sample. I.e. \[
(X_{i},Y_i)\overset{iid}{\sim}(X,Y),\quad\text{for all}\quad i=1,\dots,n.
\] Moreover, let \[
Y_i=\beta_0 + \beta_1 X_i +\epsilon_i \quad\text{for all}\quad i=1,\dots,n,
\tag{2.1}\] where
\(\epsilon_i\) and \(X_i\) are independent from each other for all \(i=1,\dots,n\)
\(E(\epsilon_i)=0\) for all \(i=1,\dots,n\)
\(Var(\epsilon_i)=\sigma^2>0\) is constant for all \(i=1,\dots,n\)
Note
The assumption \(f(X) = \beta_0 + \beta_1 X\) may be a useful working model. However, despite what many textbooks might tell us, we seldom believe that the true (unknown) relationship is that simple.
For a given observed realization of the training data random sample \[
(x_1,y_1),\dots,(x_n,y_n)
\] we choose \(\hat\beta_0\) and \(\hat\beta_1\) such that the Residual Sum of Squares criterion is minimized: \[
\begin{align*}
\operatorname{RSS}\equiv \operatorname{RSS}(\hat{\beta}_0,\hat{\beta_1})
& = e_1^2 + \dots + e_n^2\\[2ex]
&=\sum_{i=1}^n\left(y_i - \left(\hat\beta_0 + \hat\beta_1x_i\right)\right)^2\\[2ex]
&=\sum_{i=1}^n\left(y_i - \hat{y}_i\right)^2
\end{align*}
\] The minimizers are \[
\hat\beta_1=\frac{\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^n(x_i-\bar{x})^2}
\] and \[
\hat\beta_0=\bar{y} - \hat\beta_1\bar{x},
\] where \(\bar{y}=\frac{1}{n}\sum_{i=1}^ny_i\) and \(\bar{x}=\frac{1}{n}\sum_{i=1}^nx_i\).
2.1.2 Assessing the Accuracy of the Coefficient Estimates
If the assumption of Equation 2.1 is correct, i.e., if the data is actually generated according to the model \[
Y_i=\beta_0 + \beta_1 X_i +\epsilon_i,
\] then ordinary least squares estimators \[
\hat\beta_0\quad\text{and}\quad\hat\beta_1
\] are unbiased estimators, that is \[
\begin{align*}
\operatorname{Bias}(\hat\beta_0)&=E(\hat\beta_0)-\beta_0=0\\
\operatorname{Bias}(\hat\beta_1)&=E(\hat\beta_1)-\beta_1=0.
\end{align*}
\tag{2.2}\] I.e., on average, the estimation results equal the true (unknown) parameters.
Note: In Equation 2.2, we consider \(\hat\beta_0\) and \(\hat\beta_1\) as random variables from which we can compute mean values. The estimators \(\hat\beta_0\) and \(\hat\beta_1\) are indeed random variables since they depend on the random variables in the random sample\((X_i,Y_i)\overset{iid}{\sim}(X,Y)\): \[
\hat\beta_1=\frac{\sum_{i=1}^n(X_i-\bar{X})(Y_i-\bar{Y})}{\sum_{i=1}^n(X_i-\bar{X})^2}
\] and \[
\hat\beta_0=\bar{Y} - \hat\beta_1\bar{X}.
\]
In an actual data analysis, we only have one realization of the estimators \(\hat\beta_0\) and \(\hat\beta_1\) computed from the given dataset (i.e. the observed realization of the random sample). Each single estimation result will have estimation errors, i.e., \[
\hat\beta_0\neq \beta_0\quad\text{and}\quad\hat\beta_1\neq \beta_1.
\]
The following code generates artificial data to reproduce the plot in Figure 3.3 of our course textbook ISLR.
## ################################# A function to generate data ## similar to that shown in Fig 3.3## ################################ A Function to simulate datamySimpleRegrDataGenerator <-function(){ n <-50# sample size beta_0 <-0.1# intercept parameter beta_1 <-5# slope parameter X <-runif(n, min =-2, max =2) # predictor error <-rnorm(n, mean =0, sd =8.5) # error term Y <- beta_0 + beta_1 * X + error # outcome ##return(data.frame("Y"= Y, "X"= X))}## Generate a first realization of the dataset.seed(123)data_sim <-mySimpleRegrDataGenerator()head(data_sim)
Y X
1 -18.4853427 -0.8496899
2 12.9872926 1.1532205
3 -0.4167901 -0.3640923
4 -1.9138159 1.5320696
5 19.5667725 1.7618691
6 -5.3639241 -1.8177740
Using repeated samples form the data generating process defined in mySimpleRegrDataGenerator(), we can generate multiple estimation results \(\hat\beta_0\) and \(\hat\beta_1\) of the unknown simple linear regression parameters \(\beta_0\) and \(\beta_1\) and plot the corresponding empirical regression lines:
## Estimationlm_obj <-lm(Y ~ X, data = data_sim)## Plotting the results## True (usually unknown) parameter valuesbeta_0 <-0.1# intercept parameterbeta_1 <-5# slope parameterpar(mfrow=c(1,2)) # Two plots side by side## First Plot (fit for the first realization of the data)plot(x = data_sim$X, y = data_sim$Y, xlab ="X", ylab ="Y")abline(a = beta_0, b = beta_1, col ="red")abline(lm_obj, col ="blue")## Second Plot (fits for multiple data realizations)plot(x = data_sim$X, y = data_sim$Y, xlab ="X", ylab ="Y", type ="n") # type = "n": empty plot##for(r in1:10){ data_sim_new <-mySimpleRegrDataGenerator() lm_obj_new <-lm(Y ~ X, data=data_sim_new)abline(lm_obj_new, col ="lightskyblue")}## Adding the first fitabline(a = beta_0, b = beta_1, col ="red", lwd =2)abline(lm_obj, col ="blue", lwd =2)
Coding Challenge (Nr 1):
Use the above mySimpleRegrDataGenerator() function to generate many, for instance \(B=10000\), realizations of the artificial data generating process, and based on these \(B\) many realizations of \(\hat\beta_0\) and \(\hat\beta_1:\)\[
\hat\beta_{0,1},\hat\beta_{0,2},\dots,\hat\beta_{0,B}
\]\[
\hat\beta_{1,1},\hat\beta_{1,2},\dots,\hat\beta_{1,B}
\] The averages of these realizations can be used to approximate the true means of the estimators, i.e. \[
\begin{align*}
E(\hat\beta_0)&\approx \frac{1}{B}\sum_{b=1}^B\hat\beta_{0,b}\\[2ex]
E(\hat\beta_1)&\approx \frac{1}{B}\sum_{b=1}^B\hat\beta_{1,b}
\end{align*}
\] and thus also the biases \[
\begin{align*}
\operatorname{Bias}(\hat\beta_0)&\approx \frac{1}{B}\sum_{b=}^B\hat\beta_{0,b} -\beta_0\\
\operatorname{Bias}(\hat\beta_1)&\approx \frac{1}{B}\sum_{b=}^B\hat\beta_{0,b}-\beta_1.
\end{align*}
\] The approximations become arbitrarily precise as \(B\to\infty\) due to the Law of Large Numbers.
Question: Are the estimators \(\hat\beta_0\) and \(\hat\beta_1\) unbiased for the artificial data generating process defined by our mySimpleRegrDataGenerator() function?
The magnitude of the estimation errors \[
\hat\beta_0-\beta_0\neq 0 \quad\text{and}\quad\hat\beta_1-\beta_1\neq 0.
\] is expressed in unites of Standard Errors: \[
\operatorname{SE}(\hat\beta_0)=\sqrt{Var(\hat\beta_1)}=\sqrt{\sigma^2\left[\frac{1}{n}+\frac{\bar{x}^2}{\sum_{i=1}^n(x_i-\bar{x})^2}\right]}
\] and \[
\operatorname{SE}(\hat\beta_1)=\sqrt{Var(\hat\beta_1)}=\sqrt{\frac{\sigma^2}{\sum_{i=1}^n(x_i-\bar{x})^2}},
\] where \[\begin{align*}
\sigma^2 & =Var(\epsilon_i)\\
\Leftrightarrow\quad \sigma & =SD(\epsilon_i) = \sqrt{Var(\epsilon_i)}
\end{align*}\] for all \(i=1,\dots,n.\)
Typically, the variance and thus also the Standard Deviation of the error term, \(\sigma = SD(\epsilon),\) are unknown, but we can estimate \(\sigma\) by the Residual Standard Error: \[\begin{align*}
\hat{\sigma}=\operatorname{RSE}
&=\sqrt{\frac{1}{n-2}\operatorname{RSS}}\\[2ex]
&=\sqrt{\frac{1}{n-2}\sum_{i=1}^n(y_i-\hat{y}_i)^2}\\[2ex]
&=\sqrt{\frac{1}{n-2}\sum_{i=1}^n\left(y_i-\left(\hat{\beta}_0+\hat{\beta}_1x_i\right)\right)^2}
\end{align*}\] where \(\operatorname{RSS}=\sum_{i=1}^n(y_i-\hat{y}_i)^2\) are the residual sum of squares.
Note
We subtract \(2\) from the sample size \(n\) since \(n-2\) are the remaining degrees of freedom in the data after estimating two parameters \(\hat\beta_0\) and \(\hat\beta_1\).
Confidence Intervals
Knowing the standard errors \[
\operatorname{SE}(\hat\beta_0)\quad\text{and}\quad\operatorname{SE}(\hat\beta_1)
\] allows us to construct approximate 95% Confidence Intervals: \[
\begin{align*}
\operatorname{CI}_{\beta_0}
&=\left[\hat{\beta}_0-2\operatorname{SE}(\hat\beta_0),\;
\hat{\beta}_0+2\operatorname{SE}(\hat\beta_0)\right] = \hat\beta_0\pm 2\operatorname{SE}(\hat\beta_0)\\[2ex]
\operatorname{CI}_{\beta_1}
&=\left[\hat{\beta}_1-2\operatorname{SE}(\hat\beta_1),\;
\hat{\beta}_1+2\operatorname{SE}(\hat\beta_1)\right] = \hat\beta_1\pm 2\operatorname{SE}(\hat\beta_1)
\end{align*}
\]
Interpretation
There is approximately a 95% change (in resamplings from the data generating process) that the random confidence interval\(\operatorname{CI}_{\beta_1}\) contains the true (fix) parameter value \(\beta_1\).
To understand the interpretation of confidence intervals, it is very instructive to look at visualizations:
Only the above frequentist point of view can be nicely interpreted. Interpreting a given, observed confidence interval is hard and often done wrong.
A given, observed confidence interval (computed from the observed realization of the training data) either contains the true parameter value or not and usually we do not know it.
Confidence Intervals for Statistical Hypothesis Testing
We can use a \((1-\alpha)\cdot 100\%\) confidence interval to do statistical hypothesis testing at the significance level \(0<\alpha<1.\) Typical significance levels:
\(\alpha=0.05\)
\(\alpha=0.01\)
Let us consider the following null-hypothesis \((H_0)\) that the true (usually unknown) value \(\beta_1\) equals the null-hypothetical value\(\beta^{(H_0)}_{1}\) versus the two-sided alternative hypothesis \((H_1)\) that the true (usually unknown) value \(\beta_1\) does not equal the null-hypothetical value \(\beta^{(H_0)}_{1}:\)\[
\begin{align*}
H_0:&\;\beta_1=\beta^{(H_0)}_{1}\\
H_1:&\;\beta_1\neq \beta^{(H_0)}_{1}
\end{align*}
\]
Classic No-Effect Null-Hypothesis
For the special case, where the null-hypothetical value \(\beta^{(H_0)}_{1}=0\) we test the classic no-effect null-hypothesis: \[
\begin{align*}
H_0:&\;\text{There is no relationship between $Y$ and $X$; i.e. $\beta_1=0$}\\
H_1:&\;\text{There is a relationship between $Y$ and $X$; i.e. $\beta_1\neq 0$}
\end{align*}
\]
Testing-Procedure:
If the observed (obs) realization of the confidence interval, \(\operatorname{CI}_{\beta_1,obs},\)contains the null-hypothetical value \(\beta^{(H_0)}_{1},\) i.e. \[
\begin{align*}
\beta^{(H_0)}_{1}&\in\operatorname{CI}_{\beta_1,obs}\\
\Leftrightarrow\beta^{(H_0)}_{1}&\in\left[\hat{\beta}_{1,obs}-2\operatorname{SE}_{obs}(\hat\beta_1),\;\hat{\beta}_{1,obs}+2\operatorname{SE}_{obs}(\hat\beta_1)\right],
\end{align*}
\] then we cannot reject the null hypothesis that \(\beta_1=\beta^{(H_0)}_{1}.\)
If, however, the observed (obs) realization of the confidence interval, \(\operatorname{CI}_{\beta_1,obs},\) does not contain the null-hypothetical value \(\beta^{(H_0)}_{1},\) i.e. \[
\begin{align*}
\beta^{(H_0)}_{1}&\not\in\operatorname{CI}_{\beta_1,obs}\\
\Leftrightarrow\beta^{(H_0)}_{1}&\not\in\left[\hat{\beta}_{1,obs}-2\operatorname{SE}_{obs}(\hat\beta_1),\;\hat{\beta}_{1,obs}+2\operatorname{SE}_{obs}(\hat\beta_1)\right],
\end{align*}
\] then we can reject the null hypothesis and adopt the alternative that \(\beta_1\neq\beta^{(H_0)}_{1}.\)
Probability of a Type I Error is Smaller than \(\alpha\)
If the null-hypothesis is true, i.e. if the true (unknown) \(\beta_1\) equals the null-hypothetical value \(\beta^{(H_0)}_{1},\)\[
\beta_1 = \beta^{(H_0)}_{1}
\] then the random confidence interval\(\operatorname{CI}_{\beta_1}\) covers the deterministic null-hypothetical value \(\beta^{(H_0)}_{1}\) with probability at least \(1-\alpha\)\[
P(\beta^{(H_0)}_{1} \in \operatorname{CI}_{\beta_1}| H_0\text{ is true})\geq 1-\alpha.
\] I.e., in \(100\) resamples we expect to see at least \((1-\alpha)\cdot 100\) coverage events.
Probability of a Type I Error: Thus, the probability of falsely rejecting the null hypothesis even though the null hypothesis is true (i.e. doing a type I error) is smaller or equal to the chosen significance level \(\alpha,\) since \[
\begin{align*}
P(\beta^{(H_0)}_{1} \in \operatorname{CI}_{\beta_1}| H_0\text{ is true})&\geq 1-\alpha\\
\Leftrightarrow\quad 1-P(\beta^{(H_0)}_{1} \not\in \operatorname{CI}_{\beta_1}| H_0\text{ is true})&\geq 1-\alpha\\
\Leftrightarrow\quad \underbrace{P(\beta^{(H_0)}_{1} \not\in \operatorname{CI}_{\beta_1}| H_0\text{ is true})}_{\text{Probability of a Type I Error}}&\leq \alpha
\end{align*}
\] Note: A type I error (rejecting \(H_0\) even though \(H_0\) is true) is also called a false positive event. We want that false positives happen only (very) rarely and therefore choose a small significance level \(\alpha\) such as \(\alpha=0.05\) or \(\alpha=0.01.\)
Coding Challenge (Nr 2):
Use the above mySimpleRegrDataGenerator() function to generate one data set. Use this data set to compute the observed realization of the confidence interval, \(\operatorname{CI}_{\beta_1,obs}\) for the (in principle unknown) parameter \(\beta_1.\) Use this observed realization of the confidence interval to test \[
\begin{align*}
H_0:&\;\beta_1=\beta^{(H_0)}_{1}\\
\text{versus}\quad
H_1:&\;\beta_1\neq \beta^{(H_0)}_{1}
\end{align*}
\] with \(\beta^{(H_0)}_{1}=5;\) i.e. with a correct null-hypothesis, since \(\beta_1=5\) in mySimpleRegrDataGenerator().
Can you reject \(H_0\)? Is the test-decision in line with your expectation?
Now, use the above mySimpleRegrDataGenerator() function to generate \(B=10,000\) many data sets. Compute for each of these data sets the observed realizations of the confidence interval, \[
\operatorname{CI}_{\beta_1,obs,b}\quad\text{for}\quad b=1,\dots,B.
\]
Use each of the observed confidence interval realizations to test \[
\begin{align*}
H_0:&\;\beta_1=\beta^{(H_0)}_{1}\\
\text{versus}\quad
H_1:&\;\beta_1\neq \beta^{(H_0)}_{1}
\end{align*}
\] with \(\beta^{(H_0)}_{1}=5;\) i.e. with a correct null-hypothesis, since \(\beta_1=5\) in mySimpleRegrDataGenerator().
What is the relative frequency of \(H_0\)-rejections? Is this relative frequency in line with your expectation?
Repeat 1. and 2., but with \[
\beta^{(H_0)}_{1}=0;
\] i.e. with a false null-hypothesis (since \(\beta_1=5\) in mySimpleRegrDataGenerator()).
Test Statistics for Statistical Hypothesis Testing
Standard errors can also be used to construct test statistics for statistical hypothesis testing. In the following, we look at the \(t\)-test statistic.
Choose a significance level \(0<\alpha<1\) such as, for instance,
\(\alpha=0.05\)
\(\alpha=0.01\)
Let us (again) consider the null-hypothesis \((H_0)\) that the true (usually unknown) value \(\beta_1\) equals the null-hypothetical value\(\beta^{(H_0)}_{1}\) versus the two-sided alternative hypothesis \((H_1)\) that the true (usually unknown) value \(\beta_1\) does not equal the null-hypothetical value \(\beta^{(H_0)}_{1}:\)\[
\begin{align*}
H_0:&\;\beta_1=\beta^{(H_0)}_{1}\\
H_1:&\;\beta_1\neq \beta^{(H_0)}_{1}
\end{align*}
\]
Under \(H_0,\) i.e. if the (unknown) true parameter \(\beta_1\) equals the null-hypothetical value, i.e. if \(\beta_1=\beta^{(H_0)}_{1},\) the random\(t\)-test statistic has a \(t\)-distribution with \((n-2)\) degrees of freedom, \[
t=\frac{\hat\beta_1 - \beta^{(H_0)}_{1}}{\operatorname{SE}(\hat\beta_1)}\overset{H_0}{\sim}t_{(n-2)}.
\]
Tip
In the multiple linear regression model \(Y_i=\beta_0+\sum_{j=1}^p\beta_jX_{ij}+\epsilon_i\) (Section 2.2), the \(t\)-test statistic for the \(j\)th predictor has the following null-distribution: \[
t=\frac{\hat\beta_j - \beta^{(H_0)}_{j}}{\operatorname{SE}(\hat\beta_j)}\overset{H_0}{\sim}t_{(n-p-1)}.
\]
\(p\)-value: The \(p\)-value is the probability of seeing a realization of the random\(t\)-test statistic, \(t,\) which is more extreme than the observed value of the test-statistic, \(t_{obs},\)\[\begin{align*}
p_{obs}
&=P_{H_0}\left(|t|\geq|t_{obs}|\right)\\[2ex]
&=2\cdot\min\{P_{H_0}\left(t\geq t_{obs} \right),\; P_{H_0}\left(t\leq t_{obs} \right)\}.
\end{align*}\] where, \(P_{H_0}\) means that the probability is computed “under \(H_0\);” i.e. for the scenario that \(H_0\) is true \((\beta_1=\beta^{(H_0)}_{1}).\)
Testing-Procedure:
If the observed realization of the \(p\)-value is larger than or equal to the significance level \[
p_{obs}\geq \alpha,
\] then we cannot reject the null hypothesis.
If, however, the observed realization of the \(p\)-value is strictly smaller than the significance level \[
p_{obs}<\alpha,
\] then we can reject the null hypothesis and adopt the alternative hypothesis.
Note
It can be shown that the above statistical hypothesis test based on confidence intervals is equivalent to the statistical hypothesis test based on the \(t\)-test statistic.
In both cases, the probability of falsely rejecting the null hypothesis even though the null hypothesis is true, is smaller than or equal to the chosen significance level \(\alpha.\)
2.1.3 Assessing the Accuracy of the Model
In tendency an accurate model has …
a low residual standard error \(\operatorname{RSE}\)\[
\operatorname{RSE}=\hat\sigma=\sqrt{\frac{\operatorname{RSS}}{n-2}}
\]
a high \(R^2\)
\[
R^2=\frac{\operatorname{TSS}-\operatorname{RSS}}{\operatorname{TSS}}=1-\frac{\operatorname{RSS}}{\operatorname{TSS}},
\] where \(0\leq R^2\leq 1\) and \[
\begin{align*}
\operatorname{TSS}&=\sum_{i=1}^n\left(y_i-\bar{y}\right)^2\\
\operatorname{RSS}&=\sum_{i=1}^n\left(y_i-\hat{y}_i\right)^2\\
\hat{y}_i&=\hat\beta_0+\hat\beta_1x_i
\end{align*}
\]
TSS: “Total Sum of Squares”
RSS: “Residual Sum of Squares”
Danger
Cautionary Note Nr 1: Do not forget that there is a irreducible error\(Var(\epsilon)=\sigma^2>0\). Thus
very low \(\operatorname{RSE}\) values, \(\operatorname{RSE}\approx 0\), and
very high \(R^2\) values, \(R^2\approx 1\),
can be warning signals indicating overfitting. While overfitting typically does not happen with a simple linear regression model, it can happen with a multiple linear regression model.
Cautionary Note Nr 2: The \(R^2\) and \(\operatorname{RSE}\) are only based on training data. In Chapter 1, we have seen that a proper assessment of the model accuracy needs to take into account test data.
\(R^2\) and correlation coefficient
In the case of the simple linear regression model, \(R^2\) equals the squared sample correlation coefficient between \(Y\) and \(X\), \[
R^2 = r_{yx}^2,
\] where \[
r_{yx}=\frac{\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_{i=1}^n(x_i-\bar{x})^2}\sqrt{\sum_{i=1}^n(y_i-\bar{y})^2}}.
\]
Tip
In the multiple linear regression model \(Y_i=\beta_0+\sum_{j=1}^p\beta_jX_{ij}+\epsilon_i\) (Section 2.2), the \(R^2\) equals the squared correlation between response and the fitted values: \[
R^2=r^2_{y\hat{y}}
\] with \[
r_{y\hat{y}}=\frac{\sum_{i=1}^n(y_i-\bar{y})(\hat{y}_i-\bar{\hat{y}})}{\sqrt{\sum_{i=1}^n(y_i-\bar{y})^2}\sqrt{\sum_{i=1}^n(\hat{y}_i-\bar{\hat{y}})^2}}.
\]
2.2 Multiple Linear Regression
The multiple linear regression model allows for more than only one predictor: \[
Y\approx \beta_0 + \beta_1 X_1 + \dots + \beta_p X_p + \epsilon
\]
Let \[
(X_{11},\dots,X_{1p},Y_1),\dots,(X_{n1},\dots,X_{np},Y_n)
\] denote a training data random sample. I.e. \[
(X_{i1},\dots,X_{ip},Y_i)\overset{iid}{\sim}(X_{1},\dots,X_{p},Y)
\] for all \(i=1,\dots,n.\)
Moreover, let \[
Y_i=\beta_0+\beta_1X_{i1}+\dots+\beta_p X_{ip}+\epsilon_i
\tag{2.3}\] for all \(i=1,\dots,n,\) where
\((p+1)<n\)
\(\epsilon_i\) and the vector of predictors \((X_{i1},\dots,X_{ip})\) are independent from each other for all \(i=1,\dots,n\)
\(E(\epsilon_i)=0\) for all \(i=1,\dots,n\)
\(Var(\epsilon_i)=\sigma^2>0\) is constant for all \(i=1,\dots,n\)
Typically, it is more convenient to write the multiple linear regression model in vector and matrix notation. Let \[\begin{align*}
X_i&=\left(\begin{matrix}X_{i0}\\X_{i1}\\ \vdots\\X_{ip}\end{matrix}\right)\quad\text{and}\quad
\beta=\left(\begin{matrix}\beta_{0}\\\beta_1\\ \vdots\\\beta_{p}\end{matrix}\right)
\end{align*}\] with \(X_{i0}=1\) for all \(i=1,\dots,n.\)
This allows us to write Equation 2.3 more compactly as \[
Y_i=X_i'\beta+\epsilon_i \quad\text{for all}\quad i=1,\dots,n.
\]
Next, we can stack all components. Let \[\begin{equation*}
Y=\left(\begin{matrix}Y_1\\ \vdots\\Y_n\end{matrix}\right)\quad\text{and}\quad
\epsilon=\left(\begin{matrix}\epsilon_1\\ \vdots\\\epsilon_n\end{matrix}\right)
\end{equation*}\] denote the \((n\times 1)\) vectors containing all response values \(Y_i\) and all error terms \(\epsilon_i\) of the random sample. Moreover, let \[\begin{align*}
X
&=\left(\begin{matrix}
X_{10}&X_{11}&\dots&X_{1p}\\
\vdots&\vdots&\ddots&\vdots\\
X_{n0}&X_{n1}&\dots&X_{np}\\
\end{matrix}\right)\\[2ex]
&=\left(\begin{matrix}
\;\;1\;\;&X_{11}&\dots&X_{1p}\\
\;\;\vdots\;\;&\vdots&\ddots&\vdots\\
\;\;1\;\;&X_{n1}&\dots&X_{np}\\
\end{matrix}\right)
\end{align*}\] denote the \((n\times (p+1))\)-dimensional matrix containing all predictor values of the random sample, where the first column is a column full of ones \((X_{10}=1,\dots,X_{n0}=0).\)
This allows us to write Equation 2.3 even more compactly as \[
Y=X\beta+\epsilon.
\]
Using matrix algebra, it can be shown that the ordinary least squares estimator of \(\beta=(\beta_0,\beta_1,\dots,\beta_p)'\) is given by the following \(((p+1)\times 1)\) vector: \[
\left(\begin{matrix}\hat\beta_1\\ \vdots \\ \hat\beta_K\end{matrix}\right)=\hat{\beta}=(X'X)^{-1}X'Y.
\]
For a given observed realization of the training data random sample \[
(x_{11},\dots,x_{1p},y_1),\dots,(x_{n1},\dots,x_{np},y_n),
\] the estimator \(\hat\beta=(\beta_0,\beta_1,\dots,\beta_p)'\) is selected by minimizing \[\begin{align*}
\operatorname{RSS}
&=\sum_{i=1}^n\left(y_i-\hat{y}_i\right)^2\\[2ex]
&=\sum_{i=1}^n\left(y_i-\left(\hat\beta_0 + \hat\beta_1 x_{i1} \dots + \hat\beta_p x_{ip}\right)\right)^2.
\end{align*}\]
Since the estimator \[
\hat\beta=\left(\begin{matrix}\hat\beta_1\\ \vdots \\ \hat\beta_K\end{matrix}\right)
\] depends on the random sample, it is itself a \((p+1)\)-dimensional random variable. For each realization of the training data random sample, we observe a realization of the estimator.
In an actual data analysis, however, we only have one realization of the estimators \(\hat\beta_0,\hat\beta_1,\dots,\hat\beta_p\) computed from the given training dataset (i.e. the observed realization of the training data random sample).
Interpretation of Multiple Linear Regressions and the Omitted Variable Bias
Multiple linear regression is more than mere composition of single simple linear regression models. Take a look at the following two simple linear regression results:
Observations:
In the first simple linear regression, we see a statistical significant effect of radio on sales; i.e. we can reject the null hypotheses \[
H_0:\beta_{radio}=0
\] and adopt the alternative hypotheses \[
H_1:\beta_{radio}\neq 0.
\]
In the second simple linear regression, we see a statistical significant effect of newspaper on sales; i.e. we can reject the null hypotheses \[
H_0:\beta_{newspaper}=0
\] and adopt the alternative hypotheses \[
H_1:\beta_{newspaper}\neq 0.
\]
By contrast, when looking at the multiple linear regression when regressing sales onto
TV,
radio and
newspaper,
then the effect of newspaper becomes statistically insignificant; see Table 3.4.
Reason: Omitted Variable Bias
The reason for this change from a statistically significant effect of newspaper in the simple linear regression, to an insignificant effect in the multiple linear regression is the so-called Omitted Variable Bias.
Explanation of the omitted variables bias:
radio has a true positive effect on sales
newspaper has actually no effect on sales
But, newspaper is correlated with radio\(r_{\texttt{newspaper},\texttt{radio}}=0.3541\); see Table 3.5
Thus, when omittingradio from the regression model, newspaper becomes a surrogate for radio and we see a spurious effect.
Danger
Interpreting statistically significant results as true effects (“a change of \(X_j\) by one unit causes on average a change in \(Y\) by \(\beta_{j}\)”) is a delectate thing.
Even if the \(f(X)\) is really so simple that we can write it as a simple or multiple linear regression model, we may miss to include all relevant predictor variables and thus statistically significant results may only be spurious effects due to omitted variables.
Interpretation of the Coefficients in Table 3.4
For fixed values of TV and newspaper, spending additionally 1000 USD for radio, increases on average sales by approximately 189 units.
2.2.2 Inference on \(\beta_1,\dots,\beta_p\)
The \(t\)-test statistic (equivalently the confidence interval for \(\beta_j\)) allows us to test a null-hypothesis about one parameter \(\beta_j.\)
To test whether there is a relationship between the response \(Y\) and total vector predictors \((X_1,\dots,X_p)\) we can use the \(F\)-test statistic.
In this case, the \(F\)-test tests the null-hypothesis \[
\begin{align*}
H_0:&\;\beta_1=\beta_2=\dots=\beta_p=0\\
\text{versus}\quad H_1:&\;\text{at least one $\beta_j\neq 0$; $j=1,\dots,p$}
\end{align*}
\]
\(F\)-test statistic \[
F=\frac{(\operatorname{TSS}-\operatorname{RSS})/p}{\operatorname{
RSS}/(n-p-1)}\overset{H_0}{\sim} F_{p,n-p-1}
\] Under \(H_0,\) i.e. if \(H_0\) is true, the \(F\)-test statistic has a \(F\)-distribution with \(p\) numerator and \((n-p-1)\) denominator degrees of freedom.
If \(H_0\) is correct \[
\begin{align*}
E((\operatorname{TSS}-\operatorname{RSS})/p)&=\sigma^2\\[2ex]
E(\operatorname{RSS}/(n-p-1))&=\sigma^2
\end{align*}
\]
Therefore:
If \(H_0\) is correct, we expect values of \(F\approx 1.\)
If \(H_1\) is correct, we expect values of \(F\gg 1.\)
\(p\)-value
\[\begin{align*}
p_{obs}
&=P_{H_0}\left( F \geq F_{obs} \right),
\end{align*}\] where \(F_{obs}\) denotes the observed value of the \(F\)-test statistic computed from the observed training data, and where \(F\) is a random variable that has a \(F_{q,n-p-1}\) distribution. \(P_{H_0}\) means that the probability is computed for the scenario that \(H_0\) is true.
To do the statistical hypothesis test, we need to select a significance level \(\alpha\) (e.g. \(\alpha=0.01\) or \(\alpha=0.05\)).
If the observed realization of the \(p\)-value is larger than or equal to the significance level \[
p_{obs}\geq \alpha,
\] then we cannot reject the null hypothesis.
If, however, the observed realization of the \(p\)-value is strictly smaller than the significance level \[
p_{obs}<\alpha,
\] then we can reject the null hypothesis and adopt the alternative hypothesis.
2.3 Other Considerations in the Regression Model
2.3.1 Qualitative Predictors
Often some predictors are qualitative variables (also known as a factor variables). For instance, the Credit dataset contains the following qualitative predictors:
own (house ownership: yes/no)
student (student status: yes/no)
status (marital status: yes/no)
region (regions: east, west or south)
Predictors with Only Two Levels
If a qualitative predictor (factor) only has two levels (i.e. possible values), then incorporating it into a regression model is very simple: We simply create an indicator or dummy variable that takes on two possible numerical values; for instance, \[
x_{i} = \left\{
\begin{array}{ll}
1&\quad \text{if the $i$th person owns a house}\\
0&\quad \text{if the $i$th person does not own a house.}
\end{array}\right.
\] Using this dummy variable as a predictor in the regression equation results in the following regression model: \[\begin{align*}
y_{i}
&=\beta_0 + \beta_1 x_i + \epsilon_i\\[2ex]
&= \left\{
\begin{array}{ll}
\beta_0 + \beta_1 + \epsilon_i &\quad \text{if the $i$th person owns a house}\\
\beta_0 + \epsilon_i &\quad \text{if the $i$th person does not own a house}
\end{array}\right.
\end{align*}\]
Interpretation:
\(\beta_0\): The average credit card balance among those who do not own a house
\(\beta_0+\beta_1\): The average credit card balance among those who do own a house
\(\beta_1\): The average difference in credit card balance between owners and non-owners
Alternatively, instead of a 0/1 coding scheme, we could create a dummy variable \[
x_{i} = \left\{
\begin{array}{ll}
1 &\quad \text{if the $i$th person owns a house}\\
-1 &\quad \text{if the $i$th person does not own a house.}
\end{array}\right.
\]\[\begin{align*}
y_{i}
&=\beta_0 + \beta_1 x_i + \epsilon_i\\[2ex]
&= \left\{
\begin{array}{ll}
\beta_0 + \beta_1 + \epsilon_i&\quad \text{if the $i$th person owns a house}\\
\beta_0 - \beta_1 + \epsilon_i&\quad \text{if the $i$th person does not own a house}
\end{array}\right.
\end{align*}\]
Interpretation:
\(\beta_0\): The overall average credit card balance (ignoring the house ownership effect)
\(\beta_1\): The average amount by which house owners and non-owners have credit card balances that are above and below the overall average, respectively.
Qualitative Predictors with More than Two Levels
When a qualitative predictor has more than two levels, a single dummy variable cannot represent all possible values. In this situation, we can create additional dummy variables. For example, for the
region\(\in\{\)South, West, East\(\}\)
variable, we create two dummy variables. The first could be \[
x_{i1} = \left\{
\begin{array}{ll}
1&\quad \text{if the $i$th person is from the South}\\
0&\quad \text{if the $i$th person is not from the South,}
\end{array}\right.
\] and the second could be \[
x_{i2} = \left\{
\begin{array}{ll}
1&\quad \text{if the $i$th person is from the West}\\
0&\quad \text{if the $i$th person is not from the West.}
\end{array}\right.
\] Using both of these dummy variables results in the following regression model: order to obtain the model \[\begin{align*}
y_{i}&=\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \epsilon_i\\[2ex]
&= \left\{
\begin{array}{ll}
\beta_0 + \beta_1 + \epsilon_i& \quad \text{if the $i$th person is from the South}\\
\beta_0 + \beta_2 + \epsilon_i& \quad \text{if the $i$th person is from the West}\\
\beta_0 + \epsilon_i& \quad \text{if the $i$th person is from the East.}\\
\end{array}\right.
\end{align*}\]
Interpretation:
\(\beta_0\): The average credit card balance for individuals from the East
\(\beta_1\): The difference in the average balance between people from the South versus the East
\(\beta_2\): The difference in the average balance between people from the West versus the East
There are many different ways of coding qualitative variables besides the dummy variable approach taken here. All of these approaches lead to equivalent model fits, but the coefficients are different and have different interpretations, and are designed to measure particular contrasts. (A detailed discussion of contrasts is beyond the scope of this lecture.)
which states, for instance, that the average increase in sales associated with a one-unit increase in TV is \(\beta_1,\) regardless of the amount spent on radio.
However, this simple model may be incorrect. Suppose that there is a synergy effect, such that spending money on radio advertising actually increases the effectiveness of TV advertising.
Figure 3.5 suggests that such an effect may be present in the advertising data:
When levels of either TV or radio are low, then the true sales are lower than predicted by the linear model.
But when advertising is split between the two media, then the model tends to underestimate sales.
Solution: Interaction Effects:
Consider the standard linear regression model with two variables, \[
Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \epsilon.
\] Here each predictor \(X_1\) and \(X_2\) has a given effect, \(\beta_1\) and \(\beta_2\), on \(Y\) and this effect does not depend on the value of the other predictor. (Additive Assumption)
One way of extending this model is to include a third predictor, called an interaction term, which is constructed by computing the product of \(X_1\) and \(X_2.\) This results in the model \[
Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 \overbrace{\color{red}X_1X_2}^{=X_3} + \epsilon.
\] This is a powerful extension relaxing the additive assumption. Notice that the model can now be written as \[
\begin{align*}
Y &= \beta_0 + \underbrace{(\beta_1 + \beta_3 X_2)}_{=\tilde{\beta}_1(X_2)} X_1 + \beta_2 X_2 + \epsilon,
\end{align*}
\] where the new slope parameter \(\tilde{\beta}_2(X_2)\) is a linear function of \(X_2,\) i.e. \[
\tilde{\beta}_1(X_2)=\beta_1 + \beta_3 X_2.
\]
Thus, a change in the value of \(X_2\) will change the association between \(X_1\) and \(Y.\)
A similar argument shows that a change in the value of \(X_1\) changes the association between \(X_2\) and \(Y.\)
Let us return to the Advertising example: A linear model that predicts sales using
\(\beta_3\) denotes the increase in the effectiveness of TV advertising associated with a one-unit increase in radio advertising.
Interpretation of Table 3.9:
Both separate main effects, TV and radio, are statistically significant (\(p\)-values smaller than 0.01).
Additionally, the \(p\)-value for the interaction term, TV\(\times\)radio, is extremely low, indicating that there is strong evidence for \(H_1: \beta_3\neq 0.\) In other words, it is clear that the true relationship is not additive.
Hierarchical Principle of Interaction Terms
If we include an interaction in a model, we should also include the main effects, even if the \(p\)-values associated with their coefficients are not significant.
Interactions with Qualitative Variables:
An interaction between a qualitative variable and a quantitative variable has a particularly nice interpretation.
Consider the Credit data set and suppose that we wish to predict balance using the predictors:
income (quantitative) and
student (qualitative) using a dummy variable with \[
x_{i2}=\left\{
\begin{array}{ll}
1&\text{if person $i$ is a student}\\
0&\text{if not}\\
\end{array}
\right.
\]
In the absence of an interaction term, the model takes the form
Thus, the regression lines for students and non-students have different intercepts, \(\beta_0+\beta_2\) versus \(\beta_0\), but the same slope\(\beta_1\).
This represents a potentially serious limitation of the model, since a change in income may have a very different effect on the credit card balance of a student versus a non-student.
This limitation can be addressed by adding an interaction variable, created by multiplying income with the dummy variable for student. Our model now becomes
Now we have different intercepts for students and non-students but also different slopes for these groups.
Polynomial Regression: Non-linear Relationships
Polynomial regression allows to accommodate non-linear relationships between the predictors \(X\) and the outcome \(Y.\)
For example, the points in Figure 3.8 seem to have a quadratic shape, suggesting that a model of the form
This regression model involves predicting mpg using a non-linear function of horsepower.
Important
But it is still a linear model! It’s simply a multiple linear regression model \[
Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \epsilon
\] with
\(X_1=\)horsepower and
\(X_2 =(\)horsepower\()^2\)
as the predictor variables.
Since this is nothing but a multiple linear regression model, we can use standard linear regression software to estimate \(\beta_0\), \(\beta_1\), and \(\beta_2\) in order to fit the (quadratic) non-linear regression function.
2.3.3 Potential Problems
1. Non-linearity of the response-predictor relationships.
Diagnostic residual plots are most useful to detect possible non-linear response-predictor relationships.
library("ISLR2")data(Auto) ## Gives the variable names in the Auto dataset# names(Auto)## Simple linear regressionlmobj_1 <-lm(mpg ~ horsepower, data = Auto)## Quadratic regression lmobj_2 <-lm(mpg ~ horsepower +I(horsepower^2), data = Auto)## Diagnostic Plotpar(mfrow =c(1,2))plot(lmobj_1, which =1)plot(lmobj_2, which =1)
Plotting residuals versus fitted values, i.e. plotting the data pairs \[
(\underbrace{y_i - \hat{y}_i}_{=e_i}, \hat{y}_i),\quad\text{for all}\quad i=1,\dots,n
\] is a useful graphical tool for identifying non-linearity which works for the
simple linear regression model with \(\hat{y}_i=\hat{\beta}_0+\hat{\beta}_1x_{i1}\) and the
multiple linear regression model with \(\hat{y}_i=\hat{\beta}_0+\sum_{j=1}^p\hat{\beta}_jx_{ij}\)
If the residual plot indicates that there are non-linear associations in the data, then a simple approach is to use non-linear transformations of the predictors, such as \[
\log(X),\; \sqrt{X},\; \text{or}\; X^2
\] in the regression model.
2. Correlation of Error Terms
An important assumption of the linear regression model is that the error terms, \[
\epsilon_1, \epsilon_2, \dots , \epsilon_n,
\] are independent and thus uncorrelated. What does this mean? For instance, if the errors are uncorrelated, then the fact that \(\epsilon_i\) is positive provides little or no information about the sign of \(\epsilon_{i+1}.\)
Auto-correlations among the error terms typically occur in time series data. Figure 3.10 shows time-series of residuals with
no auto-correlation (\(\rho=0\))
intermediate auto-correlation (\(\rho=0.5\))
strong auto-correlation (\(\rho=0.9\))
3. Non-Constant Variance of Error Terms (Heteroskedasticity)
Another important assumption of the linear regression model is that the error terms have a constant variance, \[
Var(\epsilon_i) = \sigma^2,\quad\text{for all}\quad i=1,\dots,n.
\]
One can identify non-constant variances (“heteroskedasticity”) in the errors, using diagnostic residual plots.
Often one observes that the magnitude of the scattering of the residuals tends to increase with the fitted values. When faced with this problem, one possible solution is to transform the response \(Y\) using a concave function such as \[
\log(Y)\;\text{ or }\; \sqrt{Y}.
\] Such a transformation results in a greater amount of shrinkage of the larger responses.
## Quadratic regression lmobj_2 <-lm(mpg ~ horsepower +I(horsepower^2), data = Auto)## Quadratic regression with transformed response log(Y)lmobj_3 <-lm(I(log(mpg)) ~ horsepower +I(horsepower^2), data = Auto)## Diagnostic Plotpar(mfrow =c(1,2))plot(lmobj_2, which =1)plot(lmobj_3, which =1)
Danger
The standard formulas for
standard errors,
confidence intervals, and
hypothesis tests
in this chapter are based on the assumption of
uncorrelated error terms \(Cor(\epsilon_i,\epsilon_j)=0\) for all \(i\neq j\)
with equal variances \(Var(\epsilon_i)=\sigma^2\) for all \(i=1,\dots,n.\)
If in fact there is correlation and/or heteroskedasticity among the error terms, then the estimated standard errors will be wrong leading to invalid inferences.
Thus, if the error terms are auto-correlated and/or heteroskedastic, we need to take this into account by using so-called auto-correlation and/or heteroskedasticity robust standard errors.
The R package sandwich contains such robust standard error estimators.
4. Outliers
An outlier is a point \(i\) for which \(y_i\) is far from the value \(\hat{y}_i\) predicted by the model. Outliers can arise for a variety of reasons, such as incorrect recording of an observation during data collection.
Outliers typically have a strong effect on the \(R^2\) value since they add a very large residual to its computation.
Harmless Outlier: Figure 3.12 shows a clear outlier (observation 20) which, however, has a typical predictor value \(x_i\); i.e. the \(x_i\)-value is right in the center of all predicor values. Such outliers have little effect on the regression fit.
Harmful Outlier: Figure 3.13 shows again a clear outlier (observation 41) which, however, has a predictor value \(x_i\) that is very atypical. Such outliers are said to have large leverage giving them power to affect the regression fit considerably.
Tip
Critical outliers have both,
large residuals
and
large leverage.
5. High Leverage Points
In order to quantify an observation’s leverage, we compute the leverage statistic\(h_i\) for each observation \(i=1,\dots,n.\) A large value of this statistic indicates an observation with high leverage.
In the case of the simple linear regression model\[
h_i = \frac{1}{n} + \frac{(x_i-\bar{x})^2}{\sum_{i'=1}^n(x_{i'}-\bar{x})^2}.
\]
In the case of the multiple linear regression model, \(h_i\) is the \(i\)th diagonal value of the \((n\times n)\)-dimensional “hat-matrix” \[
H=X(X'X)^{-1}X'.
\]
The leverage statistic \(h_i\) is always between \(1/n\) and \(1\)
The average leverage for all the observations is equal to \[
\bar{h}=\frac{1}{n}\sum_{i=1}^n h_i=(p + 1)/n.
\]
If a given observation has a leverage statistic \(h_i\) that greatly exceeds the average leverage value, \((p+1)/n,\) then we may suspect that the corresponding point has high leverage.
6. Collinearity
Collinearity refers to the situation in which two or more predictor variables are closely related to one another.
In the following example, the variables Age and Limit are essentially unrelated, but the variables Rating and Limit are closely related to one another.
library("ISLR2")data(Credit) # names(Credit)par(mfrow=c(1,2))plot(y = Credit$Age, x = Credit$Limit, main ="No Collinearity", ylab ="Age", xlab ="Limit")plot(y = Credit$Rating, x = Credit$Limit, main ="Strong Collinearity", ylab ="Rating", xlab ="Limit")
The left panel of Figure 3.15 shows that, in the case of unrelated predictors (Age and Limit), the least squares problem has a minimum \((\hat\beta_{Age},\hat\beta_{Limit})\) that is well identified since the minimum is well defined.
The right panel of Figure 3.15 shows that, in the case of collinear predictors (Rating and Limit), the least squares problem has a minimum \((\hat\beta_{Rating},\hat\beta_{Limit})\) that is not well identified: One can substitute values of \(\hat\beta_{Limit}'\) for \(\hat\beta_{Rating}'\) ending up in new pairs \((\hat\beta_{Rating}',\hat\beta_{Limit}')\) with basically the same RSS-value than the original value than it is achieved by the minimizer \((\hat\beta_{Rating},\hat\beta_{Limit})\).
Table 3.11 demonstrates that this identification problem between the collinear predictors (Rating and Limit) causes a variance inflation in the variance (square of standard error) of the estimators \(\hat\beta_{Rating}\) and \(\hat\beta_{Limit}.\)
In Model 1: \(\hat\beta_{Limit} = 0.005^2=0.000025\)
In Model 2: \(\hat\beta_{Rating} = 0.064^2=0.004096\)
We call this situation multicollinearity.
To detect multicollinearity issues, one can use the variance inflation factor (VIF) \[
\operatorname{VIF}(\hat{\beta}_j)=\frac{1}{1-R^2_{X_j|X_-j}},
\] where \(R^2_{X_j|X_-j}\) is the \(R^2\) from a regression of \(X_j\) onto all of the other predictors.
If \(R^2_{X_j|X_-j}\) is close to one, then multicollinearity is present, and \(\operatorname{VIF}(\hat{\beta}_j)\) will be large.
In the Credit data, one gets for the predictors age, rating, and limit the following VIF values:
1.01 (age)
160.67 (rating)
160.59 (limit)
Thus, as we suspected, there is considerable collinearity in the data!
Possible solutions:
Drop one of the problematic variables from the regression. This can usually be done without much compromise to the regression fit, since the presence of collinearity implies that the information that this variable provides about the response is redundant in the presence of the other variables. Caution: In econometrics, dropping control variables is generally not a good idea since control variables are there to rule out possible issues with omitted variables biases.
Combine the collinear variables together into a single predictor. For instance, we might take the average of standardized versions of limit and rating in order to create a new variable that measures credit worthiness.
Use a different estimation procedure like ridge regression.
Live with it. At least you know where the large stand errors are coming from.
2.4 Comparison of Linear Regression with K-Nearest Neighbors
Linear regression is an example of a parametric approach because it assumes a linear model form for \(f(X).\)
Advantages of parametric approaches:
Typically easy to fit
Simple interpretation
Simple inference
Disadvantages of parametric approaches:
The parametric model assumption can be far from true; i.e. \[
f(X) \neq \beta_0+\beta_1X_1+\dots+\beta_pX_p
\]
Alternative: Non-parametric methods such as K-nearest neighbors regression since non-parametric approaches do not explicitly assume a parametric form for \(f(X).\)
K-nearest neighbors regression (KNN regression)
Given a value for \(K\) and a prediction point \(x_0,\) KNN regression regression computes \(\hat{f}(x_0)\) in two steps:
identify the \(K\) training observations that are closest to \(x_0\), represented by the index set \(\mathcal{N}_0\subset\{1,2,\dots,n_{Train}\}.\)
estimate \(f(x_0)\) using the average of all the training responses \(y_i\) with \(i\in\mathcal{N}_0,\) i.e. \[
\hat{f}(x_0)=\frac{1}{K}\sum_{i\in\mathcal{N}_0}y_i.
\]
The left panel of Figure 3.16 shows the estimation result for \(K=1\) and the right panel for \(K=9.\)
In general, the optimal value for \(K\) will depend on the bias-variance tradeoff, which we introduced in Chapter 1:
A small value for \(K\) provides the most flexible fit, which will have low bias but high variance. This variance is due to the fact that the prediction in a given region is entirely dependent, e.g., on just one observation if \(K=1\).
A large value of \(K\) provides a less flexible fit. The prediction in a region is an average of several points, and so changing one observation has a smaller effect. However, the smoothing may cause bias by masking some of the structure in \(f(X).\)
An optimal value of \(K\) can be chosen using, e.g., cross-validation; see Chapter 4.
Generally, the parametric approach will outperform the non-parametric approach if the parametric form that has been selected is close to the true form of \(f\) and vice versa.
Figure 3.17 provides an example with data generated from a one-dimensional linear regression model:
black solid lines: true \(f(x)\)
blue curves: KNN fits \(\hat{f}(x)\) using \(K = 1\) (left plot) and \(K = 9\) (right plot).
Observations:
The KNN fit \(\hat{f}(x)\) using \(K = 1\) is far too wiggly
The KNN fit \(\hat{f}(x)\) using \(K = 9\) is much closer to the true \(f(X).\)
However, since the true regression function is here linear, it is hard for a non-parametric approach to compete with simple linear regression: a non-parametric approach incurs a cost in variance that is here not offset by a reduction in bias.
The blue dashed line in the left-hand panel of Figure 3.18 represents the simple linear regression fit to the same data. It is almost perfect. The right-hand panel of Figure 3.18 reveals that linear regression outperforms KNN for this data.
Figure 3.19 displays a non-linear situations in which KNN performs much better than simple linear regression.
Curse of Dimensionality
Unfortunately, in higher dimensions, KNN often performs worse than simple/multiple linear regression, since non-parametric approaches suffer from the curse of dimensionality.
Figure 3.20 considers the same strongly non-linear situation as in the second row of Figure 3.19, except that we have added additional noise (i.e. redundant) predictors that are not associated with the response.
When \(p = 1\) or \(p = 2,\) KNN outperforms linear regression.
But for \(p = 3\) the results are mixed, and for \(p\geq 4\) linear regression is superior to KNN.
Observations:
When \(p=1\), a sample size of \(n=50\) can provide enough information to estimate \(f(X)\) accurately using non-parametric methods since the \(K\) nearest neighbors can actually be close to a given test observation \(x_0.\)
However, when spreading the \(n=50\) data points over a large number of, for instance, \(p=20\) dimensions, the \(K\) nearest neighbors tend to become far away from \(x_0.\)
2.5R-Lab: Linear Regression
2.5.1 Libraries
The library() function is used to load libraries, or groups of functions and data sets that are not included in the base R distribution. Basic functions that perform least squares linear regression and other simple analyses come standard with the base distribution, but more exotic functions require additional libraries.
Here we load the MASS package, which is a very large collection of data sets and functions. We also load the ISLR2 package, which includes the data sets associated with this book.
If you receive an error message when loading any of these libraries, it likely indicates that the corresponding library has not yet been installed on your system. Some libraries, such as MASS, come with R and do not need to be separately installed on your computer. However, other packages, such as ISLR2, must be downloaded the first time they are used.
This can be done, for instance, at the R command line via install.packages("ISLR2") function. This installation only needs to be done the first time you use a package. However, the library() function must be called within each R session.
2.5.2 Simple Linear Regression
The ISLR2 library contains the Boston data set, which records medv (median house value) for \(506\) census tracts in Boston. We will seek to predict medv using \(12\) predictors such as rmvar (average number of rooms per house), age (average age of houses), and lstat (percent of households with low socioeconomic status).
To find out more about the data set, we can type ?Boston.
We will start by using the lm() function to fit a simple linear regression model, with medv as the response and lstat as the predictor. The basic syntax is lm(y ~ x, data), where y is the response, x is the predictor, and data is the data set in which these two variables are kept.
lm.fit <-lm(medv ~ lstat)
Error in eval(predvars, data, env): object 'medv' not found
The command causes an error because R does not know where to find the variables medv and lstat.
The next line tells R that the variables are in Boston:
lm.fit <-lm(medv ~ lstat, data = Boston)
Alternatively, we can attach the Boston object:
attach(Boston)lm.fit <-lm(medv ~ lstat)
If we type lm.fit, some basic information about the model is output. For more detailed information, we use summary(lm.fit). This gives us \(p\)-values and standard errors for the coefficients, as well as the \(R^2\) statistic and \(F\)-statistic for the model.
For instance, the observed value of the 95% confidence interval associated with a lstat value of \(10,\) i.e. associated with \[
\beta_0 + \beta_1 \cdot 10
\] is \[
\operatorname{CI}_{\beta_0 + \beta_1 \cdot 10,obs}=[24.47, 25.63]
\] with observed prediction value \[
25.05335 = \hat\beta_0 + \hat\beta_1 \cdot 10.
\]
We will now plot medv and lstat along with the least squares regression line using the plot() and abline() functions.
plot(lstat, medv)abline(lm.fit)
There is some evidence for non-linearity in the relationship between lstat and medv. We will explore this issue later in this lab.
The abline() function can be used to draw any line, not just the least squares regression line. To draw a line with intercept a and slope b, we type abline(a, b). Below we experiment with some additional settings for plotting lines and points. The lwd = 3 command causes the width of the regression line to be increased by a factor of 3; this works for the plot() and lines() functions also. We can also use the pch option to create different plotting symbols.
plot(lstat, medv)abline(lm.fit, lwd =3, col ="red")
plot(lstat, medv, col ="red")
plot(lstat, medv, pch =20)
plot(lstat, medv, pch ="+")
plot(1:20, 1:20, pch =1:20)
Next we examine some diagnostic plots. Four diagnostic plots are automatically produced by applying the plot() function directly to the output from lm(). In general, this command will produce one plot at a time, and hitting Enter will generate the next plot. However, it is often convenient to view all four plots together. We can achieve this by using the par() and mfrow() functions, which tell R to split the display screen into separate panels so that multiple plots can be viewed simultaneously. For example, par(mfrow = c(2, 2)) divides the plotting region into a \(2 \times 2\) grid of panels.
par(mfrow =c(2, 2))plot(lm.fit)
Alternatively, we can compute the residuals from a linear regression fit using the residuals() function. The function rstudent() will return the studentized residuals, and we can use this function to plot the residuals against the fitted values.
plot(predict(lm.fit), residuals(lm.fit))
plot(predict(lm.fit), rstudent(lm.fit))
On the basis of the residual plots, there is some evidence of non-linearity.
Leverage statistics can be computed using the hatvalues() function.
plot(hatvalues(lm.fit))
which.max(hatvalues(lm.fit))
375
375
The which.max() function identifies the index of the largest element of a vector. In this case, it tells us which observation has the largest leverage statistic.
sort(hatvalues(lm.fit), decreasing =TRUE)[1:3]
375 415 374
0.02686517 0.02495670 0.02097101
The sort() function can be used to sort and print values of a vector like hatvalues(lm.fit).
2.5.3 Multiple Linear Regression
In order to fit a multiple linear regression model using least squares, we again use the lm() function. The syntax lm(y ~ x1 + x2 + x3) is used to fit a model with three predictors, x1, x2, and x3. The summary() function now outputs the regression coefficients for all the predictors.
lm.fit <-lm(medv ~ lstat + age, data = Boston)summary(lm.fit)
Call:
lm(formula = medv ~ lstat + age, data = Boston)
Residuals:
Min 1Q Median 3Q Max
-15.981 -3.978 -1.283 1.968 23.158
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 33.22276 0.73085 45.458 < 2e-16 ***
lstat -1.03207 0.04819 -21.416 < 2e-16 ***
age 0.03454 0.01223 2.826 0.00491 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.173 on 503 degrees of freedom
Multiple R-squared: 0.5513, Adjusted R-squared: 0.5495
F-statistic: 309 on 2 and 503 DF, p-value: < 2.2e-16
The Boston data set contains 12 variables, and so it would be cumbersome to have to type all of these in order to perform a regression using all of the predictors. Instead, we can use the following short-hand:
lm.fit <-lm(medv ~ ., data = Boston)summary(lm.fit)
We can access the individual components of a summary object by name (type ?summary.lm to see what is available). Hence summary(lm.fit)$r.sq gives us the \(R^2\), and summary(lm.fit)$sigma gives us the RSE. The vif() function, part of the car package, can be used to compute variance inflation factors. Most VIF’s are low to moderate for this data. The car package is not part of the base R installation so it must be downloaded the first time you use it via the install.packages() function in R.
suppressPackageStartupMessages(library(car)) # contains the vif() functionsort(vif(lm.fit)) # computes the VIF statistics and sorts them
chas black crim ptratio rm zn lstat age
1.073995 1.348521 1.792192 1.799084 1.933744 2.298758 2.941491 3.100826
dis indus nox rad tax
3.955945 3.991596 4.393720 7.484496 9.008554
What if we would like to perform a regression using all of the variables but one? For example, in the above regression output, age has a high \(p\)-value. So we may wish to run a regression excluding this predictor. The following syntax results in a regression using all predictors except age.
lm.fit1 <-lm(medv ~ . - age, data = Boston)summary(lm.fit1)
It is easy to include interaction terms in a linear model using the lm() function. The syntax lstat:black tells R to include an interaction term between lstat and black. The syntax lstat * age simultaneously includes lstat, age, and the interaction term lstat\(\times\)age as predictors; it is a shorthand for lstat + age + lstat:age. %We can also pass in transformed versions of the predictors.
summary(lm(medv ~ lstat * age, data = Boston))
Call:
lm(formula = medv ~ lstat * age, data = Boston)
Residuals:
Min 1Q Median 3Q Max
-15.806 -4.045 -1.333 2.085 27.552
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 36.0885359 1.4698355 24.553 < 2e-16 ***
lstat -1.3921168 0.1674555 -8.313 8.78e-16 ***
age -0.0007209 0.0198792 -0.036 0.9711
lstat:age 0.0041560 0.0018518 2.244 0.0252 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.149 on 502 degrees of freedom
Multiple R-squared: 0.5557, Adjusted R-squared: 0.5531
F-statistic: 209.3 on 3 and 502 DF, p-value: < 2.2e-16
2.5.5 Non-linear Transformations of the Predictors
The lm() function can also accommodate non-linear transformations of the predictors. For instance, given a predictor \(X\), we can create a predictor \(X^2\) using I(X^2). The function I() is needed since the ^ has a special meaning in a formula object; wrapping as we do allows the standard usage in R, which is to raise X to the power 2. We now perform a regression of medv onto lstat and lstat^2.
Call:
lm(formula = medv ~ lstat + I(lstat^2))
Residuals:
Min 1Q Median 3Q Max
-15.2834 -3.8313 -0.5295 2.3095 25.4148
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 42.862007 0.872084 49.15 <2e-16 ***
lstat -2.332821 0.123803 -18.84 <2e-16 ***
I(lstat^2) 0.043547 0.003745 11.63 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.524 on 503 degrees of freedom
Multiple R-squared: 0.6407, Adjusted R-squared: 0.6393
F-statistic: 448.5 on 2 and 503 DF, p-value: < 2.2e-16
The near-zero \(p\)-value associated with the quadratic term suggests that it leads to an improved model. We use the anova() function to further quantify the extent to which the quadratic fit is superior to the linear fit.
lm.fit <-lm(medv ~ lstat)anova(lm.fit, lm.fit2)
Analysis of Variance Table
Model 1: medv ~ lstat
Model 2: medv ~ lstat + I(lstat^2)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 504 19472
2 503 15347 1 4125.1 135.2 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Here Model 1 represents the linear submodel containing only one predictor, lstat, while Model 2 corresponds to the larger quadratic model that has two predictors, lstat and lstat^2. The anova() function performs a hypothesis test comparing the two models. The null hypothesis is that the two models fit the data equally well, and the alternative hypothesis is that the full model is superior.
Here the \(F\)-statistic is \(135\) and the associated \(p\)-value is virtually zero. This provides very clear evidence that the model containing the predictors lstat and lstat^2 is far superior to the model that only contains the predictor lstat.
This is not surprising, since earlier we saw evidence for non-linearity in the relationship between medv and lstat.
If we type
par(mfrow =c(2, 2))plot(lm.fit2)
then we see that when the lstat^2 term is included in the model, there is little discernible pattern in the residuals.
In order to create a cubic fit, we can include a predictor of the form I(X^3). However, this approach can start to get cumbersome for higher-order polynomials. A better approach involves using the poly() function to create the polynomial within lm(). For example, the following command produces a fifth-order polynomial fit:
This suggests that including additional polynomial terms, up to fifth order, leads to an improvement in the model fit! However, further investigation of the data reveals that no polynomial terms beyond fifth order have significant \(p\)-values in a regression fit.
By default, the poly() function orthogonalizes the predictors: this means that the features output by this function are not simply a sequence of powers of the argument. However, a linear model applied to the output of the poly() function will have the same fitted values as a linear model applied to the raw polynomials (although the coefficient estimates, standard errors, and p-values will differ). In order to obtain the raw polynomials from the poly() function, the argument raw = TRUE must be used.
Of course, we are in no way restricted to using polynomial transformations of the predictors. Here we try a log transformation.
summary(lm(medv ~log(rm), data = Boston))
Call:
lm(formula = medv ~ log(rm), data = Boston)
Residuals:
Min 1Q Median 3Q Max
-19.487 -2.875 -0.104 2.837 39.816
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -76.488 5.028 -15.21 <2e-16 ***
log(rm) 54.055 2.739 19.73 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.915 on 504 degrees of freedom
Multiple R-squared: 0.4358, Adjusted R-squared: 0.4347
F-statistic: 389.3 on 1 and 504 DF, p-value: < 2.2e-16
2.5.6 Qualitative Predictors
We will now examine the Carseats data, which is part of the ISLR2 library. We will attempt to predict Sales (child car seat sales) in \(400\) locations based on a number of predictors.
head(Carseats)
Sales CompPrice Income Advertising Population Price ShelveLoc Age Education
1 9.50 138 73 11 276 120 Bad 42 17
2 11.22 111 48 16 260 83 Good 65 10
3 10.06 113 35 10 269 80 Medium 59 12
4 7.40 117 100 4 466 97 Medium 55 14
5 4.15 141 64 3 340 128 Bad 38 13
6 10.81 124 113 13 501 72 Bad 78 16
Urban US
1 Yes Yes
2 Yes Yes
3 Yes Yes
4 Yes Yes
5 Yes No
6 No Yes
The Carseats data includes qualitative predictors such as shelveloc, an indicator of the quality of the shelving location—that is, the space within a store in which the car seat is displayed—at each location. The predictor shelveloc takes on three possible values: Bad, Medium, and Good. Given a qualitative variable such as shelveloc, R generates dummy variables automatically. Below we fit a multiple regression model that includes some interaction terms.
The contrasts() function returns the coding that R uses for the dummy variables.
attach(Carseats)contrasts(ShelveLoc)
Good Medium
Bad 0 0
Good 1 0
Medium 0 1
Use ?contrasts to learn about other contrasts, and how to set them.
R has created a ShelveLocGood dummy variable that takes on a value of 1 if the shelving location is good, and 0 otherwise. It has also created a ShelveLocMedium dummy variable that equals 1 if the shelving location is medium, and 0 otherwise. A bad shelving location corresponds to a zero for each of the two dummy variables. The fact that the coefficient for ShelveLocGood in the regression output is positive indicates that a good shelving location is associated with high sales (relative to a bad location). And ShelveLocMedium has a smaller positive coefficient, indicating that a medium shelving location is associated with higher sales than a bad shelving location but lower sales than a good shelving location.
2.5.7 Writing Functions
As we have seen, R comes with many useful functions, and still more functions are available by way of R libraries. However, we will often be interested in performing an operation for which no function is available. In this setting, we may want to write our own function. For instance, below we provide a simple function that reads in the ISLR2 and MASS libraries, called LoadLibraries(). Before we have created the function, R returns an error if we try to call it.
LoadLibraries
Error in eval(expr, envir, enclos): object 'LoadLibraries' not found
LoadLibraries()
Error in LoadLibraries(): could not find function "LoadLibraries"
We now create the function.
LoadLibraries <-function() {library(ISLR2)library(MASS)print("The libraries have been loaded.")}
Now if we type in LoadLibraries, R will tell us what is in the function.
LoadLibraries
function() {
library(ISLR2)
library(MASS)
print("The libraries have been loaded.")
}
If we call the function, the libraries are loaded in and the print statement is output.
LoadLibraries()
[1] "The libraries have been loaded."
2.6 Exercises
Prepare the following exercises of Chapter 3 in our course textbook ISLR:
Exercise 1
Exercise 2
Exercise 3
Exercise 8
Exercise 9
2.7 Solutions
Exercise 1
1 a) Describe the null hypotheses to which the \(p\)-values given in Table 3.4 correspond.
1 b) Explain what conclusions you can draw based on these \(p\)-values. Your explanation should be phrased in terms of sales, TV, radio, and newspaper, rather than in terms of the coefficients of the linear model.
Answers:
1 a) In Table 3.4, the null hypothesis for TV is that in the presence of radio ads and newspaper ads, TV ads have no effect on sales. Similarly, the null hypothesis for radio is that in the presence of TV ads and newspaper ads, radio ads have no effect on sales.
1 b) The low p-values of TV and radio allow us to reject the “no effect” null hypotheses for TV and radio. Hence, we believe that
TV ads have an effect on sales in the presence of radio and newspaper ads.
radio ads have an effect on sales in the presence of TV and newspaper ads.
The high p-value of newspaper does not allow us to reject the “no effect” null-hypothesis. This constitutes an inconclusive result and only says that the possible effects of newspaper ads are not large enough to stand out from the estimation errors.
Remember
An insignificant hypothesis test result is never informative about whether the tested null hypothesis is true. We do not have an error-control for falsely accepting the null-hypothesis, i.e. for type-II-errors. We only have an error-control (by the significance level) for falsely rejecting the null-hypothesis, i.e. for type-I-errors.
Exercise 2
Carefully explain the main difference between the KNN classifier and KNN regression methods.
Answer:
KNN classifier and KNN regression methods are closely related in formula. However, the final result of KNN classifier is the classification output for \(Y\) (qualitative), given a certain predictor \(x_0\), where as the output for a KNN regression predicts the quantitative value for \(f(x_0)\), given a certain predictor \(x_0\).
Exercise 3
Suppose we have a data set with five predictors:
\(X_1 =GPA\)
\(X_2 = IQ\)
\(X_3 = Gender\) (\(1\) for Female and \(0\) for Male)
\(X_4 =\) Interaction between \(GPA\) and \(IQ\)
\(X_5 =\) Interaction between \(GPA\) and \(Gender\)
The response variable (in thousands of dollars) is defined as:
\(Y =\) starting salary after graduation
Suppose we use least squares to fit the model, and get:
3 c) True or false: Since the coefficient for the GPA\(\times\)IQ interaction term is very small, there is very little evidence of an interaction effect. Justify your answer.
Answer:
False. We must examine the \(p\)-value (or the \(t\)-statistic) of the regression coefficient to determine if the interaction term is statistically significant or not.
Exercise 8
This question involves the use of simple linear regression on the Auto data set.
8 a) Use the lm() function to perform a simple linear regression with mpg as the response and horsepower as the predictor. Use the summary() function to print the results.
library("ISLR2")data("Auto")# Perform linear regressionlmObj_1 <-lm(mpg ~ horsepower, data=Auto)# Use summary function to print the resultssummary(lmObj_1)
Call:
lm(formula = mpg ~ horsepower, data = Auto)
Residuals:
Min 1Q Median 3Q Max
-13.5710 -3.2592 -0.3435 2.7630 16.9240
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 39.935861 0.717499 55.66 <2e-16 ***
horsepower -0.157845 0.006446 -24.49 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.906 on 390 degrees of freedom
Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
Comment on the output. For example:
i) Is there a relationship between the predictor and the response?
Answer:
Yes, there is. The predictor horsepower has a statistically significant (\(p<0.001\)) linear relationship with the response.
ii) How strong is the relationship between the predictor and the response?
Answer:
Statistical significance does not necessarily mean a practically strong or important relationship.
To quantify the strength of the relationship between the predictor and the response, we can look at the following quantities:
Residual Standard Error (RSE) (estimate of the standard deviation of \(\epsilon\)) in comparison to the RSE of the trivial linear regression model with only an intercept.
The \(R^2\) Statistic (the proportion of variance explained by the model)
The \(F\)-Statistic
The Residual Standard Error (RSE) of the regression model with intercept and horsepower as predictors is given by:
## RSE of lm(mpg ~ horsepower):RSS <-sum(resid(lmObj_1)^2)n <-length(resid(lmObj_1))RSE <-sqrt(RSS/(n-2))round(RSE, 3)
This RSE value is considerable smaller than the RSE of a model with only an intercept:
lmObj_onlyIntercept <-lm(mpg ~+1, data = Auto)RSS_onlyIntercept <-sum(resid(lmObj_onlyIntercept)^2)n <-length(resid(lmObj_onlyIntercept))RSE_onlyIntercept <-sqrt(RSS_onlyIntercept/(n-1))round(RSE_onlyIntercept, 3)
[1] 7.805
Thus, the larger model with horsepower included explains more of the variances in the response variable mpg. Including horsepower as a predictor reduces the RSE by ((RSE_onlyIntercept - RSE)/RSE_onlyIntercept)*100 %; i.e. by 37.15%.
The \(R^2\) value:
round(summary(lmObj_1)$r.squared, 2)
[1] 0.61
shows that \(60\%\) of variability in \(Y\) can be explained using an intercept and horsepower as predictors.
The value of the \(F\) statistic ::: {.cell}
round(summary(lmObj_1)$fstatistic, 2)
value numdf dendf
599.72 1.00 390.00
::: is much larger than \(1\) which means that the linear regression model with intercept and horsepower fits the data significantly better than the trivial regression model with only an intercept.
iii) Is the relationship between the predictor and the response positive or negative?
Answer:
The relationship is negative, as we can see from the parameter estimate for horsepower
coef(lmObj_1)[2]
horsepower
-0.1578447
iv) What is the predicted mpg associated with a horsepower of \(98\)? What is the associated \(95\%\) confidence interval?
8 c) Use the plot() function to produce diagnostic plots of the least squares regression fit. Comment on any problems you see with the fit.
Answer:
par(mfrow=c(2,2))plot(lmObj_1, col='blue')
Looking at the smoothing line of the residuals (\(e_i=y_i−\hat{y}_i\)) vs. the fitted values (\(\hat{y}_i\)), there is a strong pattern in the residuals, indicating non-linearity. You can see evidence of this also in the scatter plot in the answer for question 8 b).
There also appears to be non-constant variance in the error terms (heteroscedasticity), but this may be corrected to an extent when trying a quadratic fit. If not, transformations such as \(log(y)\) or \(\sqrt{y}\) can shrink larger responses by a greater amount and reduce this issue.
There are some observations with large standardized residuals & high leverage (hence, high Cook’s Distance) that we need to review.
Exercise 9
This question involves the use of multiple linear regression on the Auto data set.
9 a) Produce a scatterplot matrix which includes all of the variables in the data set.
Answer:
library("ISLR2")data("Auto")# Produce scatterplot matrixpairs(Auto)
9 b) Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, which is qualitative.
9 c) Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output by answering the below questions 9 c i) to 9 c iii).
Call:
lm(formula = mpg ~ . - name, data = Auto)
Residuals:
Min 1Q Median 3Q Max
-9.5903 -2.1565 -0.1169 1.8690 13.0604
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.218435 4.644294 -3.707 0.00024 ***
cylinders -0.493376 0.323282 -1.526 0.12780
displacement 0.019896 0.007515 2.647 0.00844 **
horsepower -0.016951 0.013787 -1.230 0.21963
weight -0.006474 0.000652 -9.929 < 2e-16 ***
acceleration 0.080576 0.098845 0.815 0.41548
year 0.750773 0.050973 14.729 < 2e-16 ***
origin 1.426141 0.278136 5.127 4.67e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.328 on 384 degrees of freedom
Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
9 c i) Is there a relationship between the predictors and the response?
Answer:
Yes, there is a relationship between the predictors and the response. By testing the null hypothesis of whether all (except intercept) the regression coefficients are zero (i.e. H\(_0\): \(\beta_1=\dots=\beta_7=0\)), we can see that the \(F\)-statistic is big and its \(p\)-value is close to zero, indicating evidence against the null hypothesis.
9 c ii) Which predictors appear to have a statistically significant relationship to the response?
Answer:
Looking at the \(p\)-values associated with each predictor’s \(t\)-statistic, we see that displacement, weight, year, and origin have a statistically significant relationship, while cylinders, horsepower, and acceleration do not.
Caution: This consideration neglects issues due to multiple testing. When testing at the significance level \(\alpha=0.05\), then each single test has a type I error (false H\(_0\) rejections) rate of up to \(5\%\). These type I error rates accumulate since we consider seven hypothesis tests simultaneously, and thus the probability of seeing one type I error among the seven tests is up to \(7\cdot 5\%=35\%\). So is quite likely to see one type I error.
Bonferroni correction for multiple testing: To determine if any of the seven predictors is statistically significant, the corresponding \(p\)-value must be smaller than \(\alpha/7\). For instance, with \(\alpha/7=0.05/7\approx 0.007\), only weight, year, and origin have a statistically significant relationships to the response.
9 c iii) What does the coefficient for the year variable suggest?
Answer:
The regression coefficient for year suggests that, on average, one year later year-of-construction is associated with an increased mpg by \(0.75\), when holding every other predictor value constant.
9 d) Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?
Answer:
par(mfrow=c(4,1))plot(fit.lm)
The “Residuals vs Fitted” plot (1st plot) shows some systematic deviations of the residuals from \(0\). The reason is that we are imposing a straight “line” (better hyper plane) fit for the conditional mean function \(E[Y|X]=f(X)\) which appears non-linear here. This results in a systematic underestimation of the true conditional mean function for large and small fitted values \(\hat{y}=\hat\beta_0+\hat\beta_1x_1+\dots+\hat\beta_px_p\).
The “Normal Q-Q” plot (2nd plot) suggests non-normally distributed residuals–particularly the upper tail deviates from that of a normal distribution.
The “Residuals vs Leverage” plot (3rd plot) shows that there are some potential outliers that we can see when: standardized residuals are below \(-2\) or above \(+2\). Moreover, the plot shows also potentially problematic “high-leverage” points with leverage values heavily exceeding the rule-of-thumb threshold \((p+1)/n=8/392=0.02\). All points with simultaneously high-leverages and large absolute standardized residuals should be handled with care since these may distort the estimation.
The “Scale-Location” plot (4th plot) shows is rather inconclusive about heteroscedasticity. However the “Residuals vs Fitted” plot (1st plot)shows some clear sign of heteroscedastic residuals.
9 e) Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?
Call:
lm(formula = mpg ~ horsepower + cylinders + year + weight * displacement,
data = Auto)
Residuals:
Min 1Q Median 3Q Max
-9.7530 -1.8228 -0.0602 1.5780 12.6133
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.210e+00 3.819e+00 -0.579 0.56316
horsepower -3.396e-02 9.560e-03 -3.552 0.00043 ***
cylinders 2.072e-01 2.914e-01 0.711 0.47756
year 7.858e-01 4.555e-02 17.250 < 2e-16 ***
weight -1.084e-02 6.346e-04 -17.076 < 2e-16 ***
displacement -7.947e-02 9.905e-03 -8.023 1.26e-14 ***
weight:displacement 2.431e-05 2.141e-06 11.355 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.976 on 385 degrees of freedom
Multiple R-squared: 0.8568, Adjusted R-squared: 0.8546
F-statistic: 384.1 on 6 and 385 DF, p-value: < 2.2e-16
Note that there is a difference between using A:B and A*B when running a regression. While the first includes only the interaction term between the variable A and B, the second one also includes the stand-alone variables A and B.
Generally, you should follow the hierarchical principle for interaction effects: If we include an interaction in a model, we should also include the main effects, even if the \(p\)-values associated with their coefficients are not significant.
9 f)
Try a few different transformations of the variables, such as \(\log(X)\), \(\sqrt{X}\), \(X^2\). Comment on your findings.
Call:
lm(formula = log(mpg) ~ . - name, data = Auto)
Residuals:
Min 1Q Median 3Q Max
-0.40955 -0.06533 0.00079 0.06785 0.33925
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.751e+00 1.662e-01 10.533 < 2e-16 ***
cylinders -2.795e-02 1.157e-02 -2.415 0.01619 *
displacement 6.362e-04 2.690e-04 2.365 0.01852 *
horsepower -1.475e-03 4.935e-04 -2.989 0.00298 **
weight -2.551e-04 2.334e-05 -10.931 < 2e-16 ***
acceleration -1.348e-03 3.538e-03 -0.381 0.70339
year 2.958e-02 1.824e-03 16.211 < 2e-16 ***
origin 4.071e-02 9.955e-03 4.089 5.28e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1191 on 384 degrees of freedom
Multiple R-squared: 0.8795, Adjusted R-squared: 0.8773
F-statistic: 400.4 on 7 and 384 DF, p-value: < 2.2e-16
##par(mfrow=c(4,1))plot(fit.lm3)
This model specification is much better!
No clear issues of systematic under/over estimations for given fitted values.
No clear issues of heteroscedastic residuals.
Normality assumption may be wrong, but this isn’t problematic since we have a large dataset, such that a central limit theorem will make the estimators asymptotically normal distributed.
One large leverage point which, however, has a small residual.