1  Statistical Learning

Reading: Chapter 2 of our course textbook An Introduction to Statistical Learning

R-Codes for this Chapter

1.1 What is Statistical Learning?

Suppose that we observe a quantitative response \(Y\) and \(p\) different predictors, \(X_1, X_2, \dots, X_p.\)

We assume that there is some relationship between \(Y\) and \[ X = (X_1, X_2, \dots, X_p), \] which can be written in the very general form \[ Y = f(X) + \epsilon. \]

  • \(f\) is some fixed but unknown function of \(X = (X_1, X_2, \dots, X_p):\)
  • \(\epsilon\) a random error term fulfilling the following two assumptions:
    • \(\epsilon\) and \(X\) are independent of each other
    • \(\epsilon\) has mean zero \(E(\epsilon)=0.\)

In this formulation, \(f\) represents the systematic information that \(X\) provides about \(Y.\)

In essence, statistical learning refers to a set of approaches for estimating \(f.\) In this chapter we outline some of the key theoretical concepts that arise in estimating \(f,\) as well as tools for evaluating the estimates obtained.

1.1.1 Why Estimate \(f\)?

There are two main reasons that we may wish to estimate \(f\):

  • prediction and
  • inference.

We discuss each in turn.

Prediction

In many situations, a set of inputs \(X\) are readily available, but the output \(Y\) cannot be easily obtained. In this setting, since the error term averages to zero, we can predict \(Y\) using \[ \hat{Y} = \hat{f}(X), \]

  • \(\hat{f}\) represents our estimate for \(f\)
  • \(\hat{Y}\) represents the resulting prediction for \(Y\)

In this setting, \(\hat{f}\) is often treated as a black box, in the sense that one is not typically concerned with the exact form of \(\hat{f},\) provided that it yields accurate predictions for \(Y.\)

Example: As an example, suppose that \((X_1, X_2, \dots, X_p)\) are characteristics of a patient’s blood sample that can be easily measured in a lab, and \(Y\) is a variable encoding the patient’s risk for a severe adverse reaction to a particular drug. It is natural to seek to predict \(Y\) using \(X,\) since we can then avoid giving the drug in question to patients who are at high risk of an adverse reaction–that is, patients for whom the estimate of \(Y\) is high.

Accuracy of a Prediction

The accuracy of \(\hat{Y}\) as a prediction for \(Y\) depends on two quantities:

  • the reducible error and
  • the irreducible error.

In general, \(\hat{f}\) will not be a perfect estimate for \(f,\) and this inaccuracy will introduce some error. This error is reducible, because we can potentially improve the accuracy of \(\hat{f}\) by using the most appropriate statistical learning technique to estimate \(f.\)

However, even if it were possible to form a perfect estimate for \(f,\) so that our estimated response took the form \[ \hat{Y} = f (X), \] our prediction would still have some error in it! This is because \(Y\) is also a function of \(\epsilon\) which, by definition, cannot be predicted using \(X.\) Therefore, variability associated with \(\epsilon\) also affects the accuracy of our predictions. This is known as the irreducible error, because no matter how well we estimate \(f,\) we cannot reduce the error introduced by \(\epsilon.\)

Consider a given estimate \(\hat{f}\) and a given set of predictors \(X,\) which yields the prediction \(\hat{Y} = \hat{f}(X).\) Assume for a moment that both \(\hat{f}\) and \(X\) are fixed, so that the only variability comes from \(\epsilon.\) Then, it is easy to show that \[\begin{align*} \overbrace{E\left[(Y - \hat{Y})^2\right]}^{\text{Mean Squared (Prediction) Error}} =\underbrace{\left(f(X) -\hat{f}(X)\right)^2}_{\text{reducable}} + \underbrace{Var\left(\epsilon\right)}_{\text{irreducable}}, \end{align*}\] where

  • \(E\left[(Y - \hat{Y})^2\right]\) represents the expected value, of the squared difference between the predicted \(\hat{Y}=\hat{f}(X)\) and actual value of \(Y,\)
  • and \(Var(\epsilon)\) represents the variance associated with the error term \(\epsilon.\)

Derivation (for a given \(\hat{f}\) and a given \(X;\) i.e. only \(\epsilon\) is random):

\[\begin{align*} &E\left[(Y - \hat{Y})^2\right]\\ &=E\left[(f(X) + \epsilon - \hat{f}(X))^2\right] \\ &=E\left[\left(f(X) -\hat{f}(X)\right)^2 - 2\left(f(X) -\hat{f}(X)\right)\epsilon + \epsilon^2\right] \\ % &=E\left[\left(f(X) -\hat{f}(X)\right)^2\right] - 2E\left[\left(f(X) -\hat{f}(X)\right)\epsilon\right] + E\left[\epsilon^2\right] \\ &=\left(f(X) -\hat{f}(X)\right)^2 - 2\left(f(X) -\hat{f}(X)\right) E\left[\epsilon\right] + E\left[\epsilon^2\right] \\ &=\left(f(X) -\hat{f}(X)\right)^2 - 2\left(f(X) -\hat{f}(X)\right) \cdot 0 + Var\left(\epsilon\right) \\ &=\underbrace{\left(f(X) -\hat{f}(X)\right)^2}_{\text{reducable}} + \underbrace{Var\left(\epsilon\right)}_{\text{irreducable}} \end{align*}\]

It is important to keep in mind that the irreducible error will always provide an upper bound on the accuracy of our prediction for \(Y,\) i.e.  \[ E\left[(Y - \hat{Y})^2\right] \geq Var\left(\epsilon\right) \] This bound is almost always unknown in practice.

Tip

The focus of this course is on techniques for estimating \(f\) with the aim of minimizing the reducible error.

(Parameter) Inference

We are often interested in understanding the association between \(Y\) and \(X_1,\dots,X_p.\) In this situation we wish to estimate \(f,\) but our goal is not necessarily to make predictions for \(Y.\) Now \(\hat{f}\) cannot be treated as a black box, because we need to know its exact form. In this setting, one may be interested in answering the following questions:

  • Which predictors are associated with the response?
  • What is the relationship between the response and each predictor?
  • Can the relationship between \(Y\) and each predictor be adequately summarized using a linear equation, or is the relationship more complicated?
Tip

In this course, we will see a number of examples that fall into the prediction setting, the inference setting, or a combination of the two.

1.2 How Do We Estimate \(f\)?

Setup: Consider the general regression model \[ Y=f(X)+\epsilon, \tag{1.1}\] where \[ X=(X_{1}, \dots,X_{p}) \] is a multivariate (\(p\)-dimensional) predictor.

Let \[ \{(X_1,Y_1),(X_2,Y_2),\dots,(X_n,Y_n)\}, \tag{1.2}\] be a random sample from Equation 1.1, i.e.

  1. The multivariate, \(p+1\) dimensional, random vectors \[ (X_i,Y_i)\quad\text{and}\quad (X_j,Y_j) \] are independent of each other for all \(i=1,\dots,n\) and \(j=1,\dots,n\) with \(i\neq j.\)

  2. The multivariate, \(p+1\) dimensional, random vectors \((X_i,Y_i),\) \(i=1,\dots,n,\) have all the same distribution as \((X,Y),\) i.e. \[ (X_i,Y_i)\sim(X,Y) \] for all \(i=1,\dots,n.\)

The random sample Equation 1.2 is thus a set of \(n\) many independent and identically distributed (iid) multivariate random variables \[ \{(X_1,Y_1),(X_2,Y_2),\dots,(X_n,Y_n)\} \] with \[ (X_i,Y_i)\overset{\text{iid}}{\sim} (X,Y),\quad i=1,\dots,n. \]

An observed realization of the random sample will be denoted using lowercase letters \[ \{(x_1,y_1),(x_2,y_2),\dots,(x_n,y_n)\}. \] These \(n\) observations are called training data because we will use these observations to train/learn \(\hat{f}\), i.e., to compute the estimate \(\hat{f}\) of the unknown \(f.\)

Goal of Statistical Learning

Our goal is to find (i.e. learn from training data) a function \(\hat{f}\) such that \[ Y \approx \hat{f}(X) \] for any observed realization of \((X, Y).\)

That is, the estimate \(\hat{f}\) needs to provide a good approximation \[ y_{i} \approx \hat{f}(x_{i}) \] not only for the observed training data points \[ (x_i,y_i),\quad i=1,\dots,n, \] but also for any possible new realization \((x_{new},y_{new})\) of \((X,Y)\) \[ y_{new} \approx \hat{f}(x_{new}). \]

Fitting the noise (irreducible component) in the training data will typically lead to bad approximations of new observations of a test data set.


Broadly speaking, most statistical learning methods for this task can be characterized as either

  • parametric or
  • non-parametric.

We discuss each in turn.

Parametric Methods

Parametric methods involve a two-step model-based estimation approach:

  1. First, we make an assumption about the functional form, or shape, of \(f.\) For example, a very simple, but often used assumption is that \(f\) is a linear function, i.e. \[ f(X) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p. \tag{1.3}\]

  2. After a model has been selected, we need a procedure that uses the training data to fit or train the model. For example, in the case of the linear model Equation 1.3, we need to estimate the parameters \(\beta_0,\beta_1,\dots,\beta_p\) such that \[ Y \approx \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p \]

Most common estimation technique: (Ordinary) Least Squares (OLS)

The parametric model-based approach reduces the problem of estimating \(f\) down to one of estimating a finite set of parameters \(\beta_0,\beta_1,\dots,\beta_p.\)

  • Pro: Simple to estimate
  • Con: Possible model misspecification (Why should we know the true shape/form of \(f\)?)

We can try to address the problem of model misspecification by choosing flexible models that can fit many different possible functional forms for \(f.\)

But: Fitting a more flexible model requires estimating a greater number of parameters (large \(p\)). These more complex models can lead to a phenomenon known as overfitting the data, which essentially means they follow the errors/noise too closely.

Tip

These issues (model-flexibility and overfitting) are discussed through-out this course.

Non-Parametric Methods

Non-parametric methods (e.g. \(K\) nearest neighbor regression) do not make explicit assumptions about the functional form of \(f.\) Instead, they make qualitative assumptions on \(f\) such as, for instance, requiring that \(f\) is a smooth (e.g. two times continuously differentiable) function.

  • Pro: By avoiding the assumption of a particular parametric functional form for \(f,\) non-parametric methods have the potential to accurately fit a wider range of possible shapes for \(f.\)

  • Con: Non-parametric methods require a large number of observations to obtain an accurate estimate for \(f;\) far more observations than is typically needed for a parametric approach if the parametric model assumption is correct. (Non-parametric methods are “data-hungry”.)

1.2.1 The Trade-Off Between Prediction Accuracy and Model Interpretability

One might reasonably ask the following question:

Why would we ever choose to use a more restrictive method instead of a very flexible approach?

If we are mainly interested in inference, then restrictive models are much more interpretable. For instance, when inference is the goal, the linear model may be a good choice since it will be quite easy to understand the relationship between \(Y\) and \(X_1,\dots,X_p.\)
By contrast, very flexible approaches, such as purely non-parametric methods, can lead to such complicated estimates of \(f\) that it is difficult to understand how any individual predictor \(X_j\) is associated with the response \(Y.\)

In some settings, we are only interested in prediction, and the interpretability of the predictive model is simply not of interest. For instance, if we seek to develop an algorithm to predict the price of a stock, our sole requirement for the algorithm is that it predict accurately. In such settings, we might expect that it will be best to use the most flexible model available. Right?
Surprisingly, this is not always the case! We will often obtain more accurate predictions using a less flexible method. This phenomenon, which may seem counterintuitive at first glance, has to do with the potential for overfitting in highly flexible methods.

1.2.2 Supervised versus Unsupervised Learning

Most statistical learning problems fall into one of two categories:

  • supervised
  • unsupervised

In supervised learning problems, we observe for each predictor \(x_i,\) \(i=1,\dots,n,\) also a response \(y_i.\)

In unsupervised learning problems, we only observe the predictor \(x_i,\) \(i=1,\dots,n,\) but not the associated responses \(y_i.\)

Supervised learning methods:

  • regression analysis
  • logistic regression
  • lasso
  • ridge regression

Unsupervised learning methods:

  • cluster analysis (clustering)
  • \(K\)-means

1.2.3 Regression Versus Classification Problems

Variables can be characterized as either quantitative or qualitative (also known as categorical).

Quantitative: Quantitative variables take on numerical values. Examples include a person’s age, height, or income, the value of a house, and categorical the price of a stock.

Qualitative/Categorial: Examples of qualitative variables include a person’s marital status (married or not), the brand of product purchased (brand A, B, or C), whether a person defaults on a debt (yes or no), or a cancer diagnosis (Acute Myelogenous Leukemia, Acute Lymphoblastic Leukemia, or No Leukemia).

We tend to refer to problems with a quantitative response as regression problems, while those involving a qualitative response are often referred to as classification problems.

However, the distinction (regression vs. classification) is not always that crisp. Least squares linear regression is used with a quantitative response, whereas logistic regression is typically used with a qualitative (two-class, or binary). Thus, despite its name, logistic regression is a classification method. But since it estimates class probabilities, it can be thought of as a regression method as well. Some statistical methods, such as K-nearest neighbors can be used in the case of either quantitative or qualitative responses.

1.3 Assessing Model Accuracy

It is an important task to decide for any given set of data which method produces the best results. Selecting the best approach can be one of the most challenging parts of performing statistical learning in practice.

Let \[ \{(X_{01},Y_{01}),(X_{02},Y_{02}),\dots,(X_{0m},Y_{0m})\}, \tag{1.4}\] denote the test data random sample, where \[ (X_{0i},Y_{0i})\overset{\text{iid}}{\sim}(X,Y),\quad i=1,\dots,m. \] with \((X,Y)\) being defined by the general regression model \(Y=f(X)+\epsilon\) in Equation 1.1.

That is, the new test data random sample Equation 1.4

  1. is independent of the training data random sample Equation 1.2
  2. has the same distribution as the training data random sample Equation 1.2

The observed realization \[ \{(x_{01},y_{01}),(x_{02},y_{02}),\dots,(x_{0m},y_{0m})\}, \] of the test data random sample is used to check the accuracy of the estimate \(\hat{f}.\)

1.3.1 Measuring the Quality of Fit

In the regression setting, the most commonly-used measure is the mean squared (prediction) error (MSE).

The global training (data) MSE is given by \[\begin{align*} \operatorname{MSE}_{\text{train}}=\frac{1}{n}\sum_{i=1}^n\left(y_i - \hat{f}(x_i)\right)^2, \end{align*}\] where

  • \(\hat{f}\) is computed from the training data
  • \(\hat{f}(x_i)\) is the prediction that \(\hat{f}\) gives for the \(i\)th training data observation.

In general, however, we do not really care how well the method works on the training data. We are interested in the accuracy of the predictions that we obtain when we apply our method to previously unseen test data.

Thus, we want to choose the method that gives the lowest test MSE, as opposed to the lowest training MSE.

Local (i.e. Point-Wise) Test MSE

Let \(\hat{f}\) be computed from the training data \(\{(x_1,y_1),\dots,(x_n,y_n)\}.\) And let \[ \{(x_{0},y_{01}),(x_{0},y_{02})\dots,(x_{0},y_{0m})\} \] denote the set of \(m\) test data points \(y_{01},\dots,y_{0m}\) for one given predictor value \(x_0\).

This type of \(x_0\)-specific test data is a realization of a conditional random sample given \(X=x_0,\) \[ (x_{0},Y_{0i})\overset{\text{iid}}{\sim}(X,Y)|X=x_0,\quad i=1,\dots,m. \] This test data random sample is independent of the training data random sample whose realization was used to compute \(\hat{f}.\)

Then, the point-wise test MSE at \(X=x_0\) is given by, \[\begin{align*} \operatorname{MSE}_{\text{test}}(x_0)= \frac{1}{m}\sum_{i=1}^m\left(y_{0i} - \hat{f}(x_{0})\right)^2. \end{align*}\]

Global Test MSE

Typically, however, we want that a method has globally, i.e. for all predictor values in the range of \(X\), a low test MSE (not only at a certain given value \(x_0\)). Let \[ \{(x_{01},y_{01}),(x_{02},y_{02})\dots,(x_{0m},y_{0m})\} \] denote the set of \(m\) test data points with different predictor values \(x_{01},\dots,x_{0m}\) in the range of \(X\). This type of test data is a realization of a random sample \[ (X_{0i},Y_{0i})\overset{\text{iid}}{\sim}(X,Y),\quad i=1,\dots,m. \] This test data random sample is independent of the training data random sample whose realization was used to compute \(\hat{f}.\)

Then, the global test MSE is given by, \[\begin{align*} \operatorname{MSE}_{\text{test}}=\frac{1}{m}\sum_{i=1}^m\left(y_{0i} - \hat{f}(x_{0i})\right)^2. \end{align*}\]

Note that if \(\hat{f}\) is a really good estimate of \(f,\) i.e. if \(\hat{f}\approx f,\) then \[ \operatorname{MSE}_{\text{test}}\approx \frac{1}{m}\sum_{i=1}^m\epsilon_{0i}^2 \] estimates the variance of the error term \(Var(\epsilon)\), i.e., the irreducible error component.


Figure 2.9 shows training and test MSEs for smoothing spline (R command smooth.spline()) estimates \(\hat{f}\) in the case of

  • a moderately complex \(f\)
  • a moderate signal-to-noise ratio \(\frac{Var(f(X))}{Var(\epsilon)}\)


Figure 2.10 shows training and test MSEs for smoothing spline estimates \(\hat{f}\) in the case of

  • a very simple \(f\)
  • a moderate signal-to-noise ratio \(\frac{Var(f(X))}{Var(\epsilon)}\)


Figure 2.11 shows training and test MSEs for smoothing spline estimates \(\hat{f}\) in the case of

  • a moderately complex \(f\)
  • a very large signal-to-noise ratio \(\frac{Var(f(X))}{Var(\epsilon)}\)


Coding Challenge (Nr 1):

Generate MSE-results similar to those shown in Figure 2.9.

In practice, one can usually compute the training MSE with relative ease, but estimating the test MSE is considerably more difficult because usually no test data are available.

As the three examples in Figures 2.9, 2.10, and 2.11 of our textbook illustrate, the flexibility level corresponding to the model with the minimal test MSE can vary considerably.

Throughout this book, we discuss a variety of approaches that can be used in practice to estimate the minimum point of the test MSE.

One important method is cross-validation, which is a method for estimating the test MSE using the training data.

1.3.2 The Bias-Variance Trade-Off

The U-shape observed in the test MSE curves (Figures 2.9–2.11) turns out to be the result of two competing properties of statistical learning methods.

  • Let \(\hat{f}\) be estimated from the training data random sample \[ \{(X_1,Y_1),\dots,(X_n,Y_n)\}, \] \[ (X_i,Y_i)\overset{\text{iid}}{\sim}(X,Y),\quad i=1,\dots,n. \] I.e., since \(\hat{f}\) is based on the random variables in the random sample, \(\hat{f}\) is it self a random variable. (A realized observation of the training data random sample yields a realized observation of \(\hat{f}\).)

  • Let \(x_0\) denote a given value of the predictor \(X\)

  • Let \[ \{(x_{0},Y_{01}),\dots,(x_0,Y_{0m})\} \] \[ (x_{0},Y_{0i})\overset{\text{iid}}{\sim}(X,Y)|X=x_0,\quad i=1,\dots,m. \] be the conditional test data random sample given \(X=x_0.\)

One can show that the expected test MSE at a given predictor value \(x_0\) can be decomposed as following: \[\begin{align*} E\left[\operatorname{MSE}_{test}(x_0)\right] & =E\left[\frac{1}{m}\sum_{i=1}^m\left(Y_{0i}- \hat{f}(x_0)\right)^2\right]\\[2ex] & =\frac{1}{m}\sum_{i=1}^mE\left[\left(Y_{0i}- \hat{f}(x_0)\right)^2\right]\quad(\text{linearity of $E$})\\[2ex] & =\frac{1}{m}\,m\,E\left[\left(Y_{0}- \hat{f}(x_0)\right)^2\right]\quad(\text{since $Y_{0i}$ are iid})\\[2ex] & =E\left[\left(Y_0- \hat{f}(x_0)\right)^2\right]\\[2ex] & =E\left[\left(f(x_0) + \epsilon_0 - \hat{f}(x_0)\right)^2\right]\quad(Y_0=f(x_0)+\epsilon_0)\\[2ex] & =E\left[\left(f(x_0)- \hat{f}(x_0)\right)^2 +2\left(f(x_0)- \hat{f}(x_0)\right)\epsilon_0 + \epsilon_0^2 \right]\\[2ex] & =E\left[\left(f(x_0)- \hat{f}(x_0)\right)^2\right]\\[2ex] &+ \underbrace{2E\left[\left(f(x_0)- \hat{f}(x_0)\right)\right]\overbrace{E\left[\epsilon_0\right]}^{=0}}_{\text{using independence between training (in $\hat{f}$) and testing data}}\\[2ex] &+ E\left[\epsilon_0^2 \right]\\[2ex] & =\underbrace{E\left[\left(f(x_0)- \hat{f}(x_0)\right)^2\right]}_{\text{MSE of $\hat{f}(x_0)$}}+0+Var(\epsilon_0)\\[2ex] %& =E\left[\left(Y_0- \hat{f}(x_0) \underbrace{+f(x_0)-f(x_0)}_{=0}\right)^2\right]\\[2ex] %& =E\left[\left(\left(f(x_0)-\hat{f}(x_0)\right)+\epsilon\right)^2\right]\\[2ex] %& =E\left[\left(f(x_0)-\hat{f}(x_0)\right)^2+2\left(f(x_0)-\hat{f}(x_0)\right)\epsilon+\epsilon^2\right]\\[2ex] %&\quad \text{Since $\epsilon$ (train) and $\hat{f}$ (test) are independent:}\\[2ex] %& =E\left[\left(f(x_0)-\hat{f}(x_0)\right)^2\right]+0+Var(\epsilon_0)\\[2ex] %& =E\left[\left(f(x_0)-\hat{f}(x_0)\underbrace{+E(\hat{f}(x_0))-E(\hat{f}(x_0))}_{=0}\right)^2\right]+Var(\epsilon_0)\\[2ex] %& =E\left[\left(-\left\{E(\hat{f}(x_0)) - f(x_0)\right\} - \left\{\hat{f}(x_0)-E(\hat{f}(x_0))\right\}\right)^2\right]+Var(\epsilon_0)\\[2ex] %&\quad\text{(steps skipped since beyond scope)}\\[2ex] & = Var\left(\hat{f}(x_0)\right) + \left[\operatorname{Bias}\left(\hat{f}(x_0)\right)\right]^2 + Var\left(\epsilon_0\right) \end{align*}\]

The expected MSE at \(x_0,\) \(E\left[\operatorname{MSE}_{test}(x_0)\right],\) refers to the average test MSE that we would obtain if we repeatedly estimated \(f\) using training data set, and evaluated each at \(x_0.\)

Note

A computed value of \(\operatorname{MSE}_{test}(x_0)\) (as done in the coding challenge) is not able to consistently approximate \(E\left[\operatorname{MSE}_{test}(x_0)\right].\)

However, to get information about Bias and Variance of a method, we need to approximate \(E\left[\operatorname{MSE}_{test}(x_0)\right].\) This will be (among others) the topic of Chapter 4.

To minimize the expected test MSE, we need to select a statistical learning method that simultaneously achieves low variance and low bias.

Note that \[ Var\left(\hat{f}(x_0)\right)\geq 0 \] and that \[ \left[\operatorname{Bias}\left(\hat{f}(x_0)\right)\right]^2\geq 0. \] Thus, the expected test MSE can never lie below of \(Var(\epsilon),\) i.e. \[ \begin{align*} E\left[\operatorname{MSE}_{test}(x_0)\right] & =E\left[\left(Y_0- \hat{f}(x_0)\right)^2\right] \geq Var\left(\epsilon\right). \end{align*} \]

➡️ The overall, i.e., global expected test MSE can be computed by averaging \(E[(Y_0- \hat{f}(x_0))^2]\) over all possible values of \(x_0\) in the test set.

Variance of \(\hat{f}\) at \(x_0\)

\[ Var(\hat{f}(x_0))=E\left[\left(\hat{f}(x_0) - E\left[\hat{f}(x_0)\right]\right)^2\right] \] Variance refers to the amount by which \(\hat{f}\) would change if we estimated it using a different training data set. Since the training data are used to fit the statistical learning method, different training data sets will result in a different \(\hat{f}.\) But ideally the estimate for \(f\) should not vary too much between training sets. However, if a method has high variance then small changes in the training data can result in large changes in \(\hat{f}.\) In general, more flexible statistical methods have higher variance.

➡️ The overall, i.e., global variance can be computed by averaging \(Var(\hat{f}(x_0))\) over all possible values of \(x_0\) in the test set.

Bias of \(\hat{f}\) at \(x_0\)

\[ \operatorname{Bias}(\hat{f}(x_0))=E\left[\hat{f}(x_0)\right] - f(x_0) \] Bias refers to the error that is introduced by approximating a real-life problem, which may be extremely complicated, by a much simpler model. Generally, more flexible methods result in less bias. As a general rule, as we use more flexible methods, the variance will increase and the bias will decrease—and vice versa.

➡️ The overall, i.e., global bias can be computed by averaging \(\operatorname{Bias}(\hat{f}(x_0))\) over all possible values of \(x_0\) in the test set.

1.3.3 The Classification Setting

Setup:

  • Observed training data \[ \{(x_1,y_1),(x_2,y_2),\dots,(x_n,y_n)\} \]
  • Qualitative response \(y_i\) with categorical class labels. E.g.
    • \(y_i\in\{\text{red},\text{blue}\}\)
    • \(y_i\in\{\text{positive returns},\text{negative returns}\}\)
  • The classifier \(\hat{f}\) is computed from the training data.
  • Predicted training data class labels: \[ \hat{y}_i = \hat{f}(x_i) \]

The alternative to the training MSE is here the training error rate \[ \frac{1}{n}\sum_{i=1}^nI(y_i\neq \hat{y}_i) \] which gives the relative frequency of false categorical predictions.

Here, \[ I(\cdot) \] is an indicator function with \(I(\text{true})=1\) and \(I(\text{false})=0.\)

Let \[ \{(y_{01},x_{01}), (y_{02},x_{02}),\dots, (y_{0m},x_{0m})\} \] denote \(m\) test data observations.

The alternative to the test MSE is here the test error rate \[ \frac{1}{m}\sum_{i=1}^mI(y_{0i}\neq \hat{y}_{0i}), \] where \(\hat{y}_{0i}\) is the predicted class label that results from applying the classifier \(\hat{f}\) (computed from the training data) to the test observation with predictor value \(x_{i0}.\)

A good classifier is one for which the test error rate is smallest.

The Bayes Classifier

It is possible to show (proof is outside of the scope of this course) that the test error rate is minimized, on average, by the classifier that assigns an observation to the most likely class, given its predictor value \(x_{0}.\) This classifier is called the Bayes classifier.

In other words, the Bayes classifier assigns a test observation with predictor vector \(x_{0}\) to the class \(j\) for which \[ P(Y = j | X = x_{0}) \] is largest among all possible class labels \(j\) (e.g. \(j\in\{1,2\}\)).

In a two-class problem where there are only two possible response values, say class \(1\) or class \(2,\) the Bayes classifier corresponds to predicting class \(1\) if \[ P(Y = 1| X = x_0 ) \geq 0.5, \] and class \(2\) if \[\begin{align*} & P(Y = 1| X = x_0 ) < 0.5 \\[2ex] \Leftrightarrow\; & P(Y = 2| X = x_0 ) > 0.5 \end{align*}\]

Classification threshold \(0.5\)?

If no further information is given, one uses usually a threshold of \(0.5\) in a two-class classification problem; or, more generally, a threshold of \(\frac{1}{G}\) in a \(G\)-classes classification problem.

However, in certain applications, different thresholds are used. If, for instance, a certain classification error is very costly, we want to take this into account when choosing the classification threshold in order to the reduce the costs due to miss-classifications.

Example: \[ y\in\{\text{Person pays back}, \text{Person does not pay back}\} \] The classification error \[ \hat{y}=\text{Person pays back} \neq y = \text{Person does not pay back} \] can be very costly for a bank. So, it makes sense to classify a person with a certain predictor value \(x_0\) to the “Person pays back”-class only if, for instance, \[ \hat{P}(Y = \text{Person pays back}|X=x_0) \geq 0.9, \] and otherwise classify this person to the “Person does not pay back”-class. This will reduce the frequency of miss-classifications when classifying into the “Person pays back”-class, and thus reduce the costs.

Those values of \(x_0\) for which \[\begin{align*} P(Y = 1| X = x_0 ) = 0.5 = P(Y = 2| X = x_0 ) \end{align*}\] are called the Bayes decision boundary. An example of a Bayes decision boundary is shown as the purple dashed line in Fig. 2.13.

  • Note to Fig. 2.13: Here, a perfect classification (i.e. zero error rate) is impossible, since the Bayes decision boundary does not partition the two groups (yellow, blue) in to complete separate groups.

The Bayes classifier produces the lowest possible test error rate, called the Bayes error rate. The point-wise Bayes error rate at \(x_0\) is given by \[ 1 - \max_{j}P(Y = j| X = x_0 ), \] where the maximization is over all class labels \(j\) (e.g. \(j\in\{1,2\}\)).

The global overall Bayes error rate is given by \[ 1 - E\left(\max_{j}P(Y = 1| X )\right), \] where the expectation averages the probability over all possible values of \(X.\)

\(K\)-Nearest Neighbors Classification

In theory we would always like to predict qualitative responses using the Bayes classifier. But for real data, we do not know the conditional distribution of \(Y\) given \(X,\) and so computing the Bayes classifier is impossible.

Many approaches attempt to estimate the conditional distribution of \(Y\) given \(X,\) and then classify a given observation to the class with highest estimated probability. One such method is the \(K\)-nearest neighbors (KNN) classifier.

Given a positive integer \(K\) and a test observation \(x_0,\) the KNN classifier first identifies the \(K\) points in the training data \[ \{(x_1,y_1),(x_2,y_2),\dots,(x_n,y_n)\}, \] that are closest to \(x_0.\)

This set of \(K\) nearest points (near to \(x_0\)) can be represented by the \(x_0\)-specfic index set \[ \mathcal{N}_0=\{i=1,\dots,n \;|\; x_i \text{ is the $K$th closest point to }x_0 \text{ or closter}\}. \] I.e. \(\mathcal{N}_0\) is an index set that allows to select the \(K\) nearest neighbors in the training data.

From the definition of \(\mathcal{N}_0\) is follows that:

  • \(\mathcal{N}_0\subset\{1,2,\dots,n\}\)
  • \(|\mathcal{N}_0|=K\)

KNN estimates the conditional probability for class \(j\) as the fraction of the \(K\) points \((x_i,y_i)\) selected by the index-set \(\mathcal{N}_0\) whose response value \(y_i\) equals \(j:\) \[ \begin{align} P(Y = j | X = x_{0}) &\approx \hat{P}(Y = j | X = x_{0})\\[2ex] &= \frac{1}{K}\sum_{i\in\mathcal{N}_0}I(y_i = j), \end{align} \tag{1.5}\] where \(I(\texttt{TRUE})=1\) and \(I(\texttt{FALSE})=0.\)

Finally, KNN classifies the test observation \(x_0\) to the class \(j\) with the largest probability from Equation 1.5.

Figure 2.14 provides an illustrative example of the KNN approach.

  • Two-dimensional predictor \(X=(X_1,X_2),\) where \(X_1\) is shown on the x-axis and \(X_2\) on the y-axis.
  • Two class labels \(Y\in\{\text{yellow}, \text{blue}\}.\)
  • Training data consists of six data points \[ \{(y_1,x_{11},x_{12}),\dots,(y_6,x_{61},x_{62})\} \] (See the left panel of Figure 2.14.)
  • Class-label prediction (“classification”) are computed for a regular grid of predictor values \(x_{0}=(x_{01},x_{02}).\) (See the regular grid of points in the right panel of Figure 2.14.)
  • \(K=3\) nearest neighbors are used to compute the class-label predictions.


  • In the left-hand panel of Figure 2.14, a small training data set is shown consisting of six blue and six orange observations. Our goal is to make a prediction for the point labeled by the black cross.
  • In the right-hand panel of Figure 2.14, we have applied the KNN approach with \(K = 3\) at all of the possible values for \(X_1\) and \(X_2,\) and have drawn in the corresponding KNN decision boundary.




Coding Challenge (Nr 2):

The following R-code generates training data for the two-class classification problem with

  • Class 1: Circle with center \((x_1,x_2)=(0.5,0.5)\) and radius \(r=0.2,\) i.e. \[ \text{Cl1} = \{(x_1,x_2)\in[0,1]^2:(x_1-0.5)^2+(x_2-0.5)^2\leq 0.2^2\} \]
  • Class 2: All data points in the square \([0,1]^2\) that do not belong to Class 1, i.e. \[ \text{Cl2} = \{(x_1,x_2)\in[0,1]^2: (x_1,x_2)\not\in\text{Cl1}\} \]
# install.packages("plotrix") # Install plotrix package
library("plotrix")   # Load plotrix package (draw.circle())

## Class 1: Circle  
radius <- 0.2
center <- c(0.5,0.5)
## Class 2: [0,1] x [0,1] \ Class 1

## function to map data points to class labels "Cl1" and "Cl2"
my_class_fun <- function(x1, x2, cent = center, rad = radius, error = 0){
  tmp    <- (x1 - cent[1])^2 + (x2 - cent[2])^2
  rad    <- rad + sample(c(-1,1), 1) * error
  ifelse(tmp <= rad^2, "Cl1", "Cl2")
}
my_class_fun <- Vectorize(my_class_fun, c("x1", "x2"))

## ##################################
## Generate training data (with error)
## ##################################
set.seed(321)

## number of training data points
n_train    <- 500

## error
error  <- 0.05

train_X1 <- runif(n_train, min = 0, max = 1)
train_X2 <- runif(n_train, min = 0, max = 1)
train_Cl <- my_class_fun(x1 = train_X1, 
                         x2 = train_X2, error = error)
train_Cl <- as.factor(train_Cl)
summary(train_Cl)
Cl1 Cl2 
 46 454 
data_train <- data.frame(
  "Cl" = train_Cl,
  "X1" = train_X1, 
  "X2" = train_X2
)

head(data_train)
   Cl        X1        X2
1 Cl2 0.9558938 0.5858416
2 Cl2 0.9372855 0.3106846
3 Cl2 0.2382205 0.7639562
4 Cl2 0.2550736 0.1329140
5 Cl2 0.3905120 0.2018027
6 Cl1 0.3411799 0.6347841
par(mfrow = c(1,2))
plot(x = 0, y = 0, 
     type = "n", 
     xlim = c(0,1),
     ylim = c(0,1), 
     xlab = expression(X[1]),
     ylab = expression(X[2]), asp=1)
abline(h = c(0,1))
draw.circle(x = center[1], y = center[2], radius = radius)
text(x = 0.5, y = 0.5, label = "Class 1")
text(x = 0.1, y = 0.9, label = "Class 2")

plot(x = 0, y = 0, 
     type = "n", 
     xlim = c(0,1),
     ylim = c(0,1), 
     xlab = expression(X[1]),
     ylab = expression(X[2]), asp=1,
     main = "Training data with errors")
abline(h = c(0,1))
draw.circle(x = center[1], y = center[2], radius = radius)

points(x   = data_train$X1, 
       y   = data_train$X2, 
       col = data_train$Cl,  pch = 19)

Challenge:

  • Write an KNN-function that uses the training data to classify a grid of test data in \([0,1]^2\) into the two classes.

1.4 R-Lab: Introduction to R

This tutorial aims to serve as an introduction to the software package R. Other very good and much more exhaustive tutorials and useful reference-cards can be found at the following links:

Some other tutorials:

Why R?

  • R is free of charge from: www.r-project.org
  • The celebrated IDE RStudio for R is also free of charge: www.rstudio.com
  • R is equipped with one of the most flexible and powerful graphics routines available anywhere. For instance, check out one of the following repositories:
  • Today, R is the de-facto standard for statistical science.

1.4.1 Short Glossary

Lets start the tutorial with a (very) short glossary:

  • Console: The thing with the > sign at the beginning.
  • Script file: An ordinary text file with suffix .R. For instance, yourFavoritFileName.R.
  • Working directory: The file-directory you are working in. Useful commands: with getwd() you get the location of your current working directory and setwd() allows you to set a new location for it.
  • Workspace: This is a hidden file (stored in the working directory), where all objects you use (e.g., data, matrices, vectors, variables, functions, etc.) are stored. Useful commands: ls() shows all elements in our current workspace and rm(list=ls()) deletes all elements in our current workspace.

1.4.2 First Steps

A good idea is to use a script file such as yourFavoritFileName.R in order to store your R commands. You can send single lines or marked regions of your R-code to the console by pressing the keys STRG+ENTER.

To begin with baby steps, do some simple computations:

2+2 # and all the others: *,/,-,^2,^3,... 
[1] 4

Note: Everything that is written after the #-sign is ignored by R, which is very useful to comment your code.

The assignment operator <- or = will be your most often used tool. Here an example to create a scalar variable:

x <- 4 
x
[1] 4
4 -> x # possible but unusual
x
[1] 4

Note: The R community loves the <- assignment operator, which is a very unusual syntax. Alternatively, you can use the more common = operator which is also used in languages like python or matlab.

And now a more interesting object - a vector:

y <- c(2,7,4,1)
y
[1] 2 7 4 1

The command ls() shows the total content of your current workspace, and the command rm(list=ls()) deletes all elements of your current workspace:

ls()
 [1] "center"       "data_train"   "error"        "my_class_fun" "n_train"     
 [6] "radius"       "train_Cl"     "train_X1"     "train_X2"     "x"           
[11] "y"           
rm(list=ls())
ls()
character(0)

Note: RStudio’s Environment pane also lists all the elements in your current workspace. That is, the command ls() becomes a bit obsolete when working with RStudio.

Let’s try how we can compute with vectors and scalars in R.

x <- 4
y <- c(2,7,4,1)

x*y # each element in the vector, y, is multiplied by the scalar, x.
[1]  8 28 16  4
y*y # this is a term by term product of the elements in y
[1]  4 49 16  1

Performing vector multiplications as you might expect from your last math-course, e.g., an outer product: \(y\,y^\top\):

y %*% t(y)
     [,1] [,2] [,3] [,4]
[1,]    4   14    8    2
[2,]   14   49   28    7
[3,]    8   28   16    4
[4,]    2    7    4    1

Or an inner product \(y^\top y\):

t(y) %*% y
     [,1]
[1,]   70

Note: Sometimes, R’s treatment of vectors can be annoying. The product y %*% y is treated as the product t(y) %*% y.

The term-by-term execution as in the above example, y*y, is actually a central strength of R. We can conduct many operations vector-wisely:

y^2
[1]  4 49 16  1
log(y)
[1] 0.6931472 1.9459101 1.3862944 0.0000000
exp(y)
[1]    7.389056 1096.633158   54.598150    2.718282
y-mean(y)
[1] -1.5  3.5  0.5 -2.5
(y-mean(y))/sd(y) # standardization 
[1] -0.5669467  1.3228757  0.1889822 -0.9449112

This is a central characteristic of so called matrix based languages like R (or Matlab). Other programming languages often have to use loops instead:

N <- length(y)
1:N

y.sq <- numeric(N)
y.sq

for(i in 1:N){
  y.sq[i] <- y[i]^2
  if(i == N){
    print(y.sq)
  }
}

The for()-loop is the most common loop. But there is also a while()-loop and a repeat()-loop. However, loops in R can be rather slow, therefore, try to avoid them!

Useful commands to produce sequences of numbers:

1:10
-10:10
?seq # Help for the seq()-function
seq(from=1, to=100, by=7)

Using the sequence command 1:16, we can go for our first matrix:

?matrix
A <- matrix(data=1:16, nrow=4, ncol=4)
A
     [,1] [,2] [,3] [,4]
[1,]    1    5    9   13
[2,]    2    6   10   14
[3,]    3    7   11   15
[4,]    4    8   12   16
A <- matrix(1:16, 4, 4)

Note that a matrix has always two dimensions, but a vector has only one dimension:

dim(A)    # Dimension of matrix A?
[1] 4 4
dim(y)    # dim() does not operate on vectors.
NULL
length(y) # Length of vector y?
[1] 4

Lets play a bit with the matrix A and the vector y. As we have seen in the loop above, the []-operator selects elements of vectors and matrices:

A[,1]
A[4,4]
y[c(1,4)]

This can be done on a more logical basis, too. For example, if you want to know which elements in the first column of matrix A are strictly greater than 2:

A[,1][A[,1]>2]
[1] 3 4
# Note that this give you a boolean vector:
A[,1]>2
[1] FALSE FALSE  TRUE  TRUE
# And you can use it in a non-sense relation, too:
y[A[,1]>2]
[1] 4 1

Note: Logical operations return so-called boolean objects, i.e., either a TRUE or a FALSE. For instance, if we ask R whether 1>2 we get the answer FALSE.

1.4.3 Further Data Objects

Besides classical data objects such as scalars, vectors, and matrices there are three further data objects in R:

  1. The array: As a matrix but with more dimensions. Here is an example of a \(2\times 2\times 2\)-dimensional array:
myFirst.Array <- array(c(1:8), dim=c(2,2,2)) # Take a look at it!
  1. The list: In lists you can organize different kinds of data. E.g., consider the following example:
myFirst.List <- list("Some_Numbers" = c(66, 76, 55, 12, 4, 66, 8, 99), 
                     "Animals"      = c("Rabbit", "Cat", "Elefant"),
                     "My_Series"    = c(30:1)) 

A very useful function to find specific values and entries within lists is the str()-function:

str(myFirst.List)
List of 3
 $ Some_Numbers: num [1:8] 66 76 55 12 4 66 8 99
 $ Animals     : chr [1:3] "Rabbit" "Cat" "Elefant"
 $ My_Series   : int [1:30] 30 29 28 27 26 25 24 23 22 21 ...
  1. The data frame: A data.frame is a list-object but with some more formal restrictions (e.g., equal number of rows for all columns). As indicated by its name, a data.frame-object is designed to store data:
myFirst.Dataframe <- data.frame("Credit_Default"   = c( 0, 0, 1, 0, 1, 1), 
                                "Age"              = c(35,41,55,36,44,26), 
                                "Loan_in_1000_EUR" = c(55,65,23,12,98,76)) 
# Take a look at it!

1.4.4 Simple Regression Analysis using R

Alright, let’s do some statistics with real data. You can download the data HERE. Save it on your computer, at a place where you can find it, and give the path (e.g. "C:\textbackslash path\textbackslash auto.data.csv", which references to the data, to the file-argument of the function read.csv():

# ATTENTION! YOU HAVE TO CHANGE "\" TO "/":
auto.data <- read.csv(file="C:/your_path/autodata.txt", header=TRUE)
head(auto.data)

If you have problems to read the data into R, go on with these commands. (For this you need a working internet connection!):

# install.packages("readr")
library("readr")
auto.data <- suppressMessages(read_csv(file = "https://cdn.rawgit.com/lidom/Teaching_Repo/bc692b56/autodata.csv",col_names = TRUE))
# head(auto.data)

You can select specific variables of the auto.data using the $-operator:

gasolin.consumption      <- auto.data$MPG.city
car.weight               <- auto.data$Weight
## Take a look at the first elements of these vectors:
head(cbind(gasolin.consumption,car.weight))
     gasolin.consumption car.weight
[1,]                  25       2705
[2,]                  18       3560
[3,]                  20       3375
[4,]                  19       3405
[5,]                  22       3640
[6,]                  22       2880

This is how you can produce your first plot:

## Plot the data:
plot(y=gasolin.consumption, x=car.weight, 
     xlab="Car-Weight (US-Pounds)", 
     ylab="Consumption (Miles/Gallon)", 
     main="Buy Light-Weight Cars!")

Figure 1.1: Scatterplot of Gasoline consumption (mpg) vs. car weight.

As a first step, we might assume a simple kind of linear relationship between the variables gasolin.consumption and car.weight. Let us assume that the data was generated by the following simple regression model: \[ y_i=\alpha+\beta_1 x_i+\varepsilon_i,\quad i=1,\dots,n \] where \(y_i\) denotes the gasoline-consumption, \(x_i\) the weight of car \(i\), and \(\varepsilon_i\) is a mean zero constant variance noise term. (This is clearly a non-sense model!)

The command lm() computes the estimates of this linear regression model. The command (in fact it’s a method) summary() computes further quantities of general interest from the object that was returned from the lm() function.

lm.result   <- lm(gasolin.consumption~car.weight)
lm.summary  <- summary(lm.result)
lm.summary

Call:
lm(formula = gasolin.consumption ~ car.weight)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.7946 -1.9711  0.0249  1.1855 13.8278 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 47.048353   1.679912   28.01   <2e-16 ***
car.weight  -0.008032   0.000537  -14.96   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.038 on 91 degrees of freedom
Multiple R-squared:  0.7109,    Adjusted R-squared:  0.7077 
F-statistic: 223.8 on 1 and 91 DF,  p-value: < 2.2e-16

Of course, we want to have a possibility to access all the quantities computed so far, e.g., in order to plot the results. This can be done as following:

## Accessing the computed quantities
names(lm.summary) ## Alternatively: str(lm.summary)
 [1] "call"          "terms"         "residuals"     "coefficients" 
 [5] "aliased"       "sigma"         "df"            "r.squared"    
 [9] "adj.r.squared" "fstatistic"    "cov.unscaled" 
alpha <- lm.summary$coefficients[1]
beta  <- lm.summary$coefficients[2]

## Plot all:
plot(y=gasolin.consumption, x=car.weight, 
     xlab="Car-Weight (US-Pounds)", 
     ylab="Consumption (Miles/Gallon)", 
     main="Buy light-weight Cars!")
abline(a=alpha, 
       b=beta, col="red")

Scatterplot of Gasoline consumption (mpg) vs. car weight plus linear regression fit.

1.4.5 Programming in R

Let’s write, i.e., program our own R-function for estimating linear regression models. In order to be able to validate our function, we start with simulating data for which we then know all true parameters.

Simulating data is like being the “Data-God”: For instance, we generate realizations of the error term \(\varepsilon_i\), i.e., something which we never observe in real data.

Let us consider the following multiple regression model:

\[y_i=\beta_1 +\beta_2 x_{2i}+\beta_3 x_{3i}+\varepsilon_{i},\quad i=1,\dots,n,\] where \(\varepsilon_{i}\) is a heteroscedastic error term \[\varepsilon_{i}\sim N(0,\sigma_i^2),\quad \sigma_i=|x_{3i}|,\]

and where for all \(i=1,\dots,n=50\):

  • \(x_{2i}\sim N(10,1.5^2)\)
  • \(x_{3i}\) comes from a t-distribution with 5 degrees of freedom and non-centrality parameter 2
set.seed(109) # Sets the "seed" of the random number generators:
n   <- 50     # Number of observations

## Generate two explanatory variables plus an intercept-variable:
X.1 <- rep(1, n)                 # Intercept
X.2 <- rnorm(n, mean=10, sd=1.5) # Draw realizations form a normal distr.
X.3 <- rt(n, df=5, ncp=2)        # Draw realizations form a t-distr.
X   <- cbind(X.1, X.2, X.3)      # Save as a Nx3-dimensional data matrix.

OK, we have regressors, i.e., data that we also have in real data sets.

Now we define the elements of the \(\beta\)-vector. Be aware of the difference: In real data sets we do not know the true \(\beta\)-vector, but try to estimate it. However, when simulating data, we determine (as “Data-Gods”) the true \(\beta\)-vector and can compare our estimate \(\hat{\beta}\) with the true \(\beta\):

## Define the slope-coefficients
beta.vec  <- c(1,-5,5)

We still need to simulate realizations of the dependent variable \(y_i\). Remember that \(y_i=\beta_1 x_{1i}+\beta_1 x_{2i}+\beta_3 x_{3i}+\varepsilon_{i}\). That is, we only need realizations from the error terms \(\varepsilon_i\) in order to compute the realizations from \(y_i\). This is how you can simulate realizations from the heteroscedastic error terms \(\varepsilon_i\):

## Generate realizations from the heteroscadastic error term
eps       <- rnorm(n, mean=0, sd=abs(X.3))

Take a look at the heteroscedasticity in the error term:

plot(y=eps, x=X.3, 
     main="Realizations of the \nHeteroscedastic Error Term")

Scatterplot of error term realizations (usually unknown) versus the predictor values of X.3.

With the (pseudo-random) realizations from \(\varepsilon_i\), we can finally generate realizations from the dependent variable \(y_i\):

## Dependent variable:
y   <- X %*% beta.vec + eps

Let’s take a look at the data:

mydata    <- data.frame("Y"=y, "X.1"=X.1, "X.2"=X.2, "X.3"=X.3)
pairs(mydata[,-2]) # The '-2' removes the intercept variable "X.1"

Once we have data, we can compute the OLS estimate of the true \(\beta\) vector. Remember the formula: \[\hat{\beta}=(X^\top X)^{-1}X^\top y\] In R-Code this is: \((X^\top X)^{-1}=\)solve(t(X) %*% X), i.e.:

## Computation of the beta-Vector:
beta.hat <- solve(t(X) %*% X) %*% t(X) %*% y
beta.hat
         [,1]
X.1 -2.609634
X.2 -4.692735
X.3  5.078342

Well done. Using the above lines of code we can easily program our own myOLSFun() function!

myOLSFun <- function(y, x, add.intercept=FALSE){
  
  ## Number of Observations:
  n         <- length(y)
  
  ## Add an intercept to x:
  if(add.intercept){
    Intercept <- rep(1, n)
    x         <- cbind(Intercept, x)
  }
  
  ## Estimation of the slope-parameters:
  beta.hat.vec <- solve(t(x) %*% x) %*% t(x) %*% y
  
  ## Return the result:
  return(beta.hat.vec)
}

## Run the function:
myOLSFun(y=y, x=X)
         [,1]
X.1 -2.609634
X.2 -4.692735
X.3  5.078342

Can you extend the function for the computation of the covariance matrix of the slope-estimates, several measures of fits (R\(^2\), adj.-R\(^2\), etc.), t-tests, …?

1.4.6 R-packages

One of the best features in R are its contributed packages. The list of all packages on CRAN is impressive! Take a look at it HERE

For instance, nice plots can be produced using the R-package is ggplot2. You can find an intro do this package HERE.

# install.packages("ggplot2")
library("ggplot2")

qplot(Sepal.Length, Petal.Length, data = iris, color = Species)
Warning: `qplot()` was deprecated in ggplot2 3.4.0.

Of course, ggplot2 concerns “only” plotting, but you’ll find R-packages for almost any statistical method out there.

1.4.7 Tidyverse

The tidyverse package is a collection of packages that lets you import, manipulate, explore, visualize and model data in a harmonized and consistent way which helps you to be more productive.

Installing the tidyverse package:

install.packages("tidyverse")

To use the tidyverse package load it using the library() function:

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.0     ✔ stringr   1.5.0
✔ forcats   1.0.0     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Chick Weight Data

R comes with many datasets installed. We will use the ChickWeight dataset to learn (a little) about the tidyverse. The help system gives a basic summary of the experiment from which the data was collect:

“The body weights of the chicks were measured at birth and every second day thereafter until day 20. They were also measured on day 21. There were four groups of chicks on different protein diets.”

You can get more information, including references by typing:

help("ChickWeight")

The Data: There are 578 observations (rows) and 4 variables:

  • Chick – unique ID for each chick.
  • Diet – one of four protein diets.
  • Time – number of days since birth.
  • weight – body weight of chick in grams.

Note: weight has a lower case w (recall R is case sensitive).

Store the data locally:

ChickWeight %>%
  dplyr::select(Chick, Diet, Time, weight) %>% 
  dplyr::arrange(Chick, Diet, Time) %>% 
  write_csv("DATA/ChickWeight.csv")

First we will import the data from a file called ChickWeight.csv using the read_csv() function from the readr package (part of the tidyverse). The first thing to do, outside of R, is to open the file ChickWeight.csv to check what it contains and that it makes sense. Now we can import the data as follows:

CW <- readr::read_csv("DATA/ChickWeight.csv")
Rows: 578 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (4): Chick, Diet, Time, weight

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

If all goes well then the data is now stored in an R object called CW. If you get the following error message then you need to change the working directory to where the data is stored:

Error: ‘ChickWeight.csv’ does not exist in current working directory …

Changing the working directory: In RStudio you can use the menu bar (“Session - Set Working Directory - Choose Directory…”). Alternatively, you can use the function setwd(). Last but not least, to avoid issues with brocken paths to files and data sets, use RStudios’ “Project” tools.

Looking at the Dataset: To look at the data type just type the object (dataset) name:

CW
# A tibble: 578 × 4
   Chick  Diet  Time weight
   <dbl> <dbl> <dbl>  <dbl>
 1    18     1     0     39
 2    18     1     2     35
 3    16     1     0     41
 4    16     1     2     45
 5    16     1     4     49
 6    16     1     6     51
 7    16     1     8     57
 8    16     1    10     51
 9    16     1    12     54
10    15     1     0     41
# ℹ 568 more rows

If there are too many variables then not all them may be printed. To overcome this issue we can use the glimpse() function which makes it possible to see every column in your dataset (called a “data frame” in R speak).

glimpse(CW)
Rows: 578
Columns: 4
$ Chick  <dbl> 18, 18, 16, 16, 16, 16, 16, 16, 16, 15, 15, 15, 15, 15, 15, 15,…
$ Diet   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ Time   <dbl> 0, 2, 0, 2, 4, 6, 8, 10, 12, 0, 2, 4, 6, 8, 10, 12, 14, 0, 2, 4…
$ weight <dbl> 39, 35, 41, 45, 49, 51, 57, 51, 54, 41, 49, 56, 64, 68, 68, 67,…

The function View() allows for a spread-sheet type of view on the data:

View(CW)

1.4.7.1 Tidyverse: Plotting Basics

To visualize the chick weight data, we will use the ggplot2 package (part of the tidyverse). Our interest is in seeing how the weight changes over time for the chicks by diet. For the moment don’t worry too much about the details just try to build your own understanding and logic. To learn more try different things even if you get an error messages.

Let’s plot the weight data (vertical axis) over time (horizontal axis). Generally, ggplot2 works in layers. The following codes generates an empty plot:

# An empty plot
ggplot(CW, aes(Time, weight))  

Empty ggplot layer.

To the empty plot, one can add fuhrer layers:

# Adding a scatter plot 
ggplot(CW, aes(Time, weight)) + geom_point() 

Adding a scatter plot layer to the empty ggplot layer.

Add color for Diet. The graph above does not differentiate between the diets. Let’s use a different color for each diet.

# Adding colour for diet
ggplot(CW,aes(Time,weight,colour=factor(Diet))) +
  geom_point() 

Adding a further layer for shown the effect of the Diet.

It is difficult to conclude anything from this graph as the points are printed on top of one another (with diet 1 underneath and diet 4 at the top).

To improve the plot, it will be handy to store Diet and Time as a factor variables.

Factor Variables: Before we continue, we have to make an important change to the CW dataset by making Diet and Time factor variables. This means that R will treat them as categorical variables (see the <fct> variables below) instead of continuous variables. It will simplify our coding. The next section will explain the mutate() function.

CW <- mutate(CW, Diet = factor(Diet))
CW <- mutate(CW, Time = factor(Time))
glimpse(CW)
Rows: 578
Columns: 4
$ Chick  <dbl> 18, 18, 16, 16, 16, 16, 16, 16, 16, 15, 15, 15, 15, 15, 15, 15,…
$ Diet   <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ Time   <fct> 0, 2, 0, 2, 4, 6, 8, 10, 12, 0, 2, 4, 6, 8, 10, 12, 14, 0, 2, 4…
$ weight <dbl> 39, 35, 41, 45, 49, 51, 57, 51, 54, 41, 49, 56, 64, 68, 68, 67,…

The facet_wrap() function: To plot each diet separately in a grid using facet_wrap():

# Adding jitter to the points
ggplot(CW, aes(Time, weight, colour=Diet)) +
  geom_point() +
  facet_wrap(~Diet) +
  theme(legend.position = "bottom")

Interpretation: Diet 4 has the least variability but we can’t really say anything about the mean effect of each diet although diet 3 seems to have the highest.

Next we will plot the mean changes over time for each diet using the stat_summary() function:

ggplot(CW, aes(Time, weight, 
               group=Diet, colour=Diet)) +
  stat_summary(fun="mean", geom="line") 

Interpretation: We can see that diet 3 has the highest mean weight gains by the end of the experiment. However, we don’t have any information about the variation (uncertainty) in the data.

To see variation between the different diets we use geom_boxplot to plot a box-whisker plot. A note of caution is that the number of chicks per diet is relatively low to produce this plot.

ggplot(CW, aes(Time, weight, colour=Diet)) +
  facet_wrap(~Diet) +
  geom_boxplot() +
  theme(legend.position = "none") +
  ggtitle("Chick Weight over Time by Diet")

Interpretation: Diet 3 seems to have the highest “average” weight gain but it has more variation than diet 4 which is consistent with our findings so far.

Let’s finish with a plot that you might include in a publication.

ggplot(CW, aes(Time, weight, group=Diet, 
                             colour=Diet)) +
  facet_wrap(~Diet) +
  geom_point() +
  # geom_jitter() +
  stat_summary(fun="mean", geom="line",
               colour="black") +
  theme(legend.position = "none") +
  ggtitle("Chick Weight over Time by Diet") + 
  xlab("Time (days)") +
  ylab("Weight (grams)")

1.4.7.2 Tidyverse: Data Wrangling Basics

In this section we will learn how to wrangle (manipulate) datasets using the tidyverse package. Let’s start with the mutate(), select(), rename(), filter() and arrange() functions.

mutate(): Adds a new variable (column) or modifies an existing one. We already used this above to create factor variables.

# Added a column
CWm1 <- mutate(CW, weightKg = weight/1000)
CWm1
# A tibble: 578 × 5
  Chick Diet  Time  weight weightKg
  <dbl> <fct> <fct>  <dbl>    <dbl>
1    18 1     0         39    0.039
2    18 1     2         35    0.035
3    16 1     0         41    0.041
# ℹ 575 more rows
# Modify an existing column
CWm2 <- mutate(CW, Diet = str_c("Diet ", Diet))
CWm2
# A tibble: 578 × 4
  Chick Diet   Time  weight
  <dbl> <chr>  <fct>  <dbl>
1    18 Diet 1 0         39
2    18 Diet 1 2         35
3    16 Diet 1 0         41
# ℹ 575 more rows

select(): Keeps, drops or reorders variables.

# Drop the weight variable from CWm1 using minus
dplyr::select(CWm1, -weight)
# A tibble: 578 × 4
  Chick Diet  Time  weightKg
  <dbl> <fct> <fct>    <dbl>
1    18 1     0        0.039
2    18 1     2        0.035
3    16 1     0        0.041
# ℹ 575 more rows
# Keep variables Time, Diet and weightKg
dplyr::select(CWm1, Chick, Time, Diet, weightKg)
# A tibble: 578 × 4
  Chick Time  Diet  weightKg
  <dbl> <fct> <fct>    <dbl>
1    18 0     1        0.039
2    18 2     1        0.035
3    16 0     1        0.041
# ℹ 575 more rows

rename(): Renames variables whilst keeping all variables.

dplyr::rename(CW, Group = Diet, Weight = weight)
# A tibble: 578 × 4
  Chick Group Time  Weight
  <dbl> <fct> <fct>  <dbl>
1    18 1     0         39
2    18 1     2         35
3    16 1     0         41
# ℹ 575 more rows

filter(): Keeps or drops observations (rows).

dplyr::filter(CW, Time==21 & weight>300)
# A tibble: 8 × 4
  Chick Diet  Time  weight
  <dbl> <fct> <fct>  <dbl>
1     7 1     21       305
2    29 2     21       309
3    21 2     21       331
# ℹ 5 more rows

For comparing values in vectors use: < (less than), > (greater than), <= (less than and equal to), >= (greater than and equal to), == (equal to) and != (not equal to). These can be combined logically using & (and) and | (or).

arrange(): Changes the order of the observations.

dplyr::arrange(CW, Chick, Time)
# A tibble: 578 × 4
  Chick Diet  Time  weight
  <dbl> <fct> <fct>  <dbl>
1     1 1     0         42
2     1 1     2         51
3     1 1     4         59
# ℹ 575 more rows
dplyr::arrange(CW, desc(weight))
# A tibble: 578 × 4
  Chick Diet  Time  weight
  <dbl> <fct> <fct>  <dbl>
1    35 3     21       373
2    35 3     20       361
3    34 3     21       341
# ℹ 575 more rows

What does the desc() do? Try using desc(Time).

1.4.7.3 The pipe operator %>%

In reality you will end up doing multiple data wrangling steps that you want to save. The pipe operator %>% makes your code nice and readable:

CW21 <- CW %>% 
  dplyr::filter(Time %in% c(0, 21)) %>% 
  dplyr::rename(Weight = weight) %>% 
  dplyr::mutate(Group = factor(str_c("Diet ", Diet))) %>% 
  dplyr::select(Chick, Group, Time, Weight) %>% 
  dplyr::arrange(Chick, Time) 
CW21
# A tibble: 95 × 4
  Chick Group  Time  Weight
  <dbl> <fct>  <fct>  <dbl>
1     1 Diet 1 0         42
2     1 Diet 1 21       205
3     2 Diet 1 0         40
# ℹ 92 more rows

Hint: To understand the code above we should read the pipe operator %>% as “then”.

Create a new dataset (object) called CW21 using dataset CW then keep the data for days 0 and 21 then rename variable weight to Weight then create a variable called Group then keep variables Chick, Group, Time and Weight and then finally arrange the data by variables Chick and Time.

This is the same code:

CW21 <- CW %>% 
  dplyr::filter(., Time %in% c(0, 21)) %>% 
  dplyr::rename(., Weight = weight) %>% 
  dplyr::mutate(., Group=factor(str_c("Diet ",Diet))) %>% 
  dplyr::select(., Chick, Group, Time, Weight) %>% 
  dplyr::arrange(., Chick, Time) 

The pipe operator, %>%, replaces the dots (.) with whatever is returned from code preceding it. For example, the dot in filter(., Time %in% c(0, 21)) is replaced by CW. The output of the filter(...) then replaces the dot in rename(., Weight = weight) and so on. Think of it as a data assembly line with each function doing its thing and passing it to the next.

1.4.7.4 The group_by() function

From the data visualizations above we concluded that the diet 3 has the highest mean and diet 4 the least variation. In this section, we will quantify the effects of the diets using summmary statistics. We start by looking at the number of observations and the mean by diet and time.

mnsdCW <- CW %>% 
  dplyr::group_by(Diet, Time) %>% 
  dplyr::summarise(N = n(), Mean = mean(weight)) %>% 
  dplyr::arrange(Diet, Time)
`summarise()` has grouped output by 'Diet'. You can override using the
`.groups` argument.
mnsdCW
# A tibble: 48 × 4
# Groups:   Diet [4]
  Diet  Time      N  Mean
  <fct> <fct> <int> <dbl>
1 1     0        20  41.4
2 1     2        20  47.2
3 1     4        19  56.5
# ℹ 45 more rows

For each distinct combination of Diet and Time, the chick weight data is summarized into the number of observations (N) and the mean (Mean) of weight.

Further summaries: Let’s also calculate the standard deviation, median, minimum and maximum values but only at days 0 and 21.

sumCW <-  CW %>% 
  dplyr::filter(Time %in% c(0, 21)) %>% 
  dplyr::group_by(Diet, Time) %>% 
  dplyr::summarise(N = n(),
            Mean = mean(weight),
            SD = sd(weight),
            Median = median(weight),
            Min = min(weight),
            Max = max(weight)) %>% 
  dplyr::arrange(Diet, Time)
`summarise()` has grouped output by 'Diet'. You can override using the
`.groups` argument.
sumCW
# A tibble: 8 × 8
# Groups:   Diet [4]
  Diet  Time      N  Mean     SD Median   Min   Max
  <fct> <fct> <int> <dbl>  <dbl>  <dbl> <dbl> <dbl>
1 1     0        20  41.4  0.995   41      39    43
2 1     21       16 178.  58.7    166      96   305
3 2     0        10  40.7  1.49    40.5    39    43
# ℹ 5 more rows

Let’s make the summaries “prettier”, say, for a report or publication.

library("knitr") # to use the kable() function
prettySumCW <- sumCW %>% 
 dplyr::mutate(`Mean (SD)` = str_c(format(Mean, digits=1),
           " (", format(SD, digits=2), ")")) %>% 
 dplyr::mutate(Range = str_c(Min, " - ", Max)) %>% 
 dplyr::select(Diet, Time, N, `Mean (SD)`, Median, Range) %>%
 dplyr::arrange(Diet, Time) %>% 
 kable(format = "latex")
prettySumCW
Diet Time N Mean (SD) Median Range
1 0 20 41 ( 0.99) 41.0 39 - 43
1 21 16 178 (58.70) 166.0 96 - 305
2 0 10 41 ( 1.5) 40.5 39 - 43
2 21 10 215 (78.1) 212.5 74 - 331
3 0 10 41 ( 1) 41.0 39 - 42
3 21 10 270 (72) 281.0 147 - 373
4 0 10 41 ( 1.1) 41.0 39 - 42
4 21 9 239 (43.3) 237.0 196 - 322

Interpretation: This summary table offers the same interpretation as before, namely that diet 3 has the highest mean and median weights at day 21 but a higher variation than group 4. However it should be noted that at day 21, diet 1 lost 4 chicks from 20 that started and diet 4 lost 1 from 10. This could be a sign of some health related issues.

1.5 Exercises

Prepare the following exercises of Chapter 2 in our course textbook ISLR:

  • Exercise 7
  • Exercise 8
  • Exercise 9

1.5.1 Solutions

Exercise 7

The table below provides a training data set containing six observations, three predictors, and one qualitative response variable. Suppose we wish to use this data set to make a prediction for \(Y\) when \(X_1 = X_2 = X_3 = 0\) using K-nearest neighbors.

Obs. \(X_1\) \(X_2\) \(X_3\) \(Y\)
1 0 3 0 Red
2 2 0 0 Red
3 0 1 3 Red
4 0 1 2 Green
5 −1 0 1 Green
6 1 1 1 Red

7. a) Compute the Euclidean distance between each observation and the test point, \(X_1 = X_2 = X_3 = 0\).

Answer:

# Outcome
Y    <- c("red", "red", "red", "green", "green", "red")
# Predictor values
obs1 <- c( 0, 3, 0)
obs2 <- c( 2, 0, 0)
obs3 <- c( 0, 1, 3)
obs4 <- c( 0, 1, 2)
obs5 <- c(-1, 0, 1)
obs6 <- c( 1, 1, 1)

# Test Point
obs0 <- c(0, 0, 0)

# Create a Vector Dist_vec to store the results
Dist <- numeric(length = 6)

# Compute and store the Euclidean distances
Dist[1] <- sqrt(sum((obs1-obs0)^2)) 
Dist[2] <- sqrt(sum((obs2-obs0)^2)) 
Dist[3] <- sqrt(sum((obs3-obs0)^2)) 
Dist[4] <- sqrt(sum((obs4-obs0)^2)) 
Dist[5] <- sqrt(sum((obs5-obs0)^2)) 
Dist[6] <- sqrt(sum((obs6-obs0)^2))  

# Print the results
Dist
[1] 3.000000 2.000000 3.162278 2.236068 1.414214 1.732051

7. b) What is your prediction with \(K = 1\)? Why?

Answer:

which.min(Dist)
[1] 5
Y[which.min(Dist)]
[1] "green"

Closest \(K=1\) neighbor is obs5 and thus, our prediction is Green because Green is the \(Y\) value associated to obs5.

7. c) What is your prediction with \(K = 3\)? Why?

Answer:

order(Dist)[1:3]
[1] 5 6 2
Y[order(Dist)[1:3]]
[1] "green" "red"   "red"  

Closest \(K=3\) neighbors are:

  • obs5 with \(Y=\)green
  • obs6 with \(Y=\)red
  • obs2 with \(Y=\)red

This leads to the following local (at \(X_1=X_2=X_3=0)\) relative frequencies:

  • for green: \(1/K=1/3\)
  • for red: \(2/K=2/3\)

Thus, our prediction for \(X_1=X_2=X_3=0\) is Red.

7. d) If the Bayes decision boundary in this problem is highly nonlinear, then would we expect the best value for \(K\) to be large or small? Why?

Answer:

In the case of a highly nonlinear decision boundary, the neighborhoods of similar \(Y\)-values close to the boundary become generally small. Therefore, also \(K\) must be chosen relatively small so that we can capture more of the non-linear decision boundary. 

Exercise 8:

This exercise relates to the College data set, which can be found in the file College.csv (LINK-TO-DATA). It contains a number of variables for \(777\) different universities and colleges in the US. The variables are:

  • Private : Public/private indicator
  • Apps : Number of applications received
  • Accept : Number of applicants accepted
  • Enroll : Number of new students enrolled
  • Top10perc : New students from top 10% of high school class
  • Top25perc : New students from top 25% of high school class
  • F.Undergrad : Number of full-time undergraduates
  • P.Undergrad : Number of part-time undergraduates
  • Outstate : Out-of-state tuition
  • Room.Board : Room and board costs
  • Books : Estimated book costs
  • Personal : Estimated personal spending
  • PhD : Percent of faculty with Ph.D.’s
  • Terminal : Percent of faculty with terminal degree
  • S.F.Ratio : Student/faculty ratio
  • perc.alumni : Percent of alumni who donate
  • Expend : Instructional expenditure per student
  • Grad.Rate : Graduation rate

8. a) Use the read.csv() function to read the data into R. Call the loaded data college. Make sure that you have the directory set to the correct location for the data.

Answer:

# Store data into dataframe college
college <- read.csv("DATA/College.csv")

# Print first 10 rows and first 5 collumns of the data
college[c(1:10),c(1:5)]
                              X Private Apps Accept Enroll
1  Abilene Christian University     Yes 1660   1232    721
2            Adelphi University     Yes 2186   1924    512
3                Adrian College     Yes 1428   1097    336
4           Agnes Scott College     Yes  417    349    137
5     Alaska Pacific University     Yes  193    146     55
6             Albertson College     Yes  587    479    158
7       Albertus Magnus College     Yes  353    340    103
8                Albion College     Yes 1899   1720    489
9              Albright College     Yes 1038    839    227
10    Alderson-Broaddus College     Yes  582    498    172

8. b) Look at the data using the fix() function.

Answer:

The fix() command allows you to look at the data and change values in the data in a spread-sheet like format.

We can do some additional data-wrangling:
You should notice that the first column is just the name of each university. We don’t really want R to treat this as data. However, it may be handy to have these names for later. Try the following commands:

# Store row names
rownames(college) <- college[,1]

# pops up a window for data visualization
# fix(college)

# Alteratively you can use: 
# View(college)

You should see that there is now a row.names column with the name of each university recorded. This means that R has given each row a name corresponding to the appropriate university. R will not try to perform calculations on the row names. However, we still need to eliminate the first column in the data where the names are stored. Try:

# Eliminates first column (containing the row names)
college <- college[,-1]
# fix(college)

Now you should see that the first data column is Private. Note that another column labeled row.names now appears before the Private column. However, this is not a data column but rather the name that R is giving to each row.

8. c. i) Use the summary() function to produce a numerical summary of the variables in the data set.

Answer:

summary(college[, 1:5])
   Private               Apps           Accept          Enroll    
 Length:777         Min.   :   81   Min.   :   72   Min.   :  35  
 Class :character   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242  
 Mode  :character   Median : 1558   Median : 1110   Median : 434  
                    Mean   : 3002   Mean   : 2019   Mean   : 780  
                    3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902  
                    Max.   :48094   Max.   :26330   Max.   :6392  
   Top10perc    
 Min.   : 1.00  
 1st Qu.:15.00  
 Median :23.00  
 Mean   :27.56  
 3rd Qu.:35.00  
 Max.   :96.00  

8. c. ii) Use the pairs() function to produce a scatterplot matrix of the 2nd to 10th column or variables of the data. Recall that you can reference the 2nd to 10th column of a matrix A using A[,2:10].

Answer:

pairs(x = college[,2:10])

8. c. iii) Use the boxplot() function to produce side-by-side boxplots of Outstate versus Private.

Answer:

boxplot(Outstate~Private, 
        data = college, 
        xlab = "Private", 
        ylab = "Outstate")

Outstate-Variable explained: The Outstate-variable is the tuition fee charged from non-resident students. (Many US colleges charge high tuition fees from non-resident students.)

8. c. iv) Create a new qualitative variable, called Elite, by binning the Top10perc variable. We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10% of their high school classes exceeds 50%.

# Creating a vector called ELite with only "No" entrances amounting the number of college rows
Elite <- rep("No",nrow(college))

# Replacing "No" with "Yes" if the proportion of students coming from the top 10% of their HS classes exceeds 50%.
Elite[college$Top10perc > 50] <- "Yes"

# Encode a vector as a factor
Elite <- as.factor(Elite)

# Add Elite variable to our current dataset "college"
college <- data.frame(college, Elite)

Use the summary() function to see how many elite universities there are. Now use the boxplot() function to produce side-by-side boxplots of Outstate versus Elite.

Answer:

summary(college$Elite)
 No Yes 
699  78 

There are \(78\) elite Universities. The boxplots of Outstate versus Elite-Status are generated as following:

boxplot(Outstate ~ Elite, 
        data = college, xlab="Elite", ylab="Outstate")

8. c. v) Use the hist() function to produce some histograms with differing numbers of bins for a few of the quantitative variables. You may find the command par(mfrow=c(2,2)) useful: it will divide the print window into four regions so that four plots can be made simultaneously. Modifying the arguments to this function will divide the screen in other ways.

Answer:

par(mfrow=c(2,2))
hist(college$Apps,     breaks=50, xlim=c(0,25000), 
     xlab = "", main="Apps")
hist(college$Enroll,   breaks=25, xlab = "", main="Enroll")
hist(college$Expend,   breaks=25, xlab = "", main="Expend")
hist(college$Outstate,            xlab = "", main="Outstate")

Remember:

  • Enroll: Number of new students enrolled
  • Expend: Instructional expenditure per student
  • Outstate: Out-of-state tuition

Exercise 9:

This exercise involves the Auto data set. Make sure that the missing values have been removed from the data.

# Store data into dataframe college
Auto <- read.csv("DATA/Auto.csv", header=T, na.strings="?")

# Remove missing values from the data
Auto <- na.omit(Auto)

# Print first 10 rows of the data
print(Auto[c(1:10),])
   mpg cylinders displacement horsepower weight acceleration year origin
1   18         8          307        130   3504         12.0   70      1
2   15         8          350        165   3693         11.5   70      1
3   18         8          318        150   3436         11.0   70      1
4   16         8          304        150   3433         12.0   70      1
5   17         8          302        140   3449         10.5   70      1
6   15         8          429        198   4341         10.0   70      1
7   14         8          454        220   4354          9.0   70      1
8   14         8          440        215   4312          8.5   70      1
9   14         8          455        225   4425         10.0   70      1
10  15         8          390        190   3850          8.5   70      1
                        name
1  chevrolet chevelle malibu
2          buick skylark 320
3         plymouth satellite
4              amc rebel sst
5                ford torino
6           ford galaxie 500
7           chevrolet impala
8          plymouth fury iii
9           pontiac catalina
10        amc ambassador dpl
# Find more info on the variables here:
# library(ISLR2)
# ?Auto
  • mpg: miles per gallon
  • cylinders: Number of cylinders between 4 and 8
  • displacement: Engine displacement (cu. inches)
  • horsepower: Engine horsepower
  • weight: Vehicle weight (lbs.)
  • acceleration: Time to accelerate from 0 to 60 mph (sec.)
  • year: Model year (modulo 100)
  • origin: Origin of car (1. American, 2. European, 3. Japanese)
  • name: Vehicle name

9. a) Which of the predictors are quantitative, and which are qualitative?

Answer:

# Summarize dataset
summary(Auto)
      mpg          cylinders      displacement     horsepower        weight    
 Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613  
 1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
 Median :22.75   Median :4.000   Median :151.0   Median : 93.5   Median :2804  
 Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5   Mean   :2978  
 3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
 Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140  
  acceleration        year           origin          name          
 Min.   : 8.00   Min.   :70.00   Min.   :1.000   Length:392        
 1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000   Class :character  
 Median :15.50   Median :76.00   Median :1.000   Mode  :character  
 Mean   :15.54   Mean   :75.98   Mean   :1.577                     
 3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000                     
 Max.   :24.80   Max.   :82.00   Max.   :3.000                     
  • Quantitative predictors: mpg, cylinders, displacement, horsepower, weight, acceleration, year
  • Qualitative predictors: name, origin

9. b) What is the range of each quantitative predictor? You can answer this using the range() function.

Answer:

# apply the range function to the first seven columns of Auto
ranges <- sapply(Auto[, 1:7], range)
# print to console
ranges
      mpg cylinders displacement horsepower weight acceleration year
[1,]  9.0         3           68         46   1613          8.0   70
[2,] 46.6         8          455        230   5140         24.8   82
# adding row names:
row.names(ranges) <- c("min", "max")
ranges
     mpg cylinders displacement horsepower weight acceleration year
min  9.0         3           68         46   1613          8.0   70
max 46.6         8          455        230   5140         24.8   82

9. c) What is the mean and standard deviation of each quantitative predictor?

Answer:

# compute the means and round to 2 decimal places 
mean <- sapply(Auto[,1:7], mean)
mean <- sapply(mean, round, 2)

# compute the standard deviations (sd) and round to 2 decimal places 
sd <- sapply(Auto[, 1:7], sd)
sd <- sapply(sd,round,2)

# print mean values 
mean
         mpg    cylinders displacement   horsepower       weight acceleration 
       23.45         5.47       194.41       104.47      2977.58        15.54 
        year 
       75.98 
# print sd values 
sd
         mpg    cylinders displacement   horsepower       weight acceleration 
        7.81         1.71       104.64        38.49       849.40         2.76 
        year 
        3.68 

9.d) Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?

Answer:

# remove observations and store them 
newAuto = Auto[-(10:85),]

# Re-do exercises 9. b) and 9.c)
# This time, create an empty Matrix "Results" to store the results
Results <- matrix(NA, nrow = 4, ncol = 7, 
                  dimnames = list(c("Mean", "SD", "Min", "Max"),# row names 
                                  c(colnames(newAuto[,1:7])))) # column names

# Compute and store the results
Results[1,] <- sapply(newAuto[, 1:7], mean)
Results[2,] <- sapply(newAuto[, 1:7], sd)  
Results[3,] <- sapply(newAuto[, 1:7], min)
Results[4,] <- sapply(newAuto[, 1:7], max)

# Round to 0 decimal and print
round(Results, digits = 0)
     mpg cylinders displacement horsepower weight acceleration year
Mean  24         5          187        101   2936           16   77
SD     8         2          100         36    811            3    3
Min   11         3           68         46   1649            8   70
Max   47         8          455        230   4997           25   82

9. e) Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your findings.

Answer:

pairs(Auto[, -9]) # remove the character-variable 'name'

Findings:

  • higher weight is related with lower mpg
  • higher weight is related with higher displacement
  • higher weight is related with higher horsepower
  • higher horsepower is related with lower acceleration
  • higher model years are related with higher mpg

9. f) Suppose that we wish to predict gas mileage (mpg) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting mpg? Justify your answer.

Answer:

Yes, there are multiple potentially useful predictor variables.

All of the quantitative variables show some sort of relation (either linear or non-linear) with mpg and hence, they might be useful in predicting mpg.

Moreover, the qualitative variable origin might also be useful in predicting mpg: cars originated from region 1, 2, and 3 are associated with lower, intermediate, higher mpg values.