The authors of ISLR frame Statistical Learning as a set of approaches for estimating \(f\) in the equation

\begin{equation} Y = f(X) + \epsilon \tag{1} \end{equation}

where \(X\) is a vector of input variables associated with some output \(Y\), \(f\) represents the systematic information that \(X\) provides about \(Y\), and \(\epsilon\) (epsilon) is a random error term, which is independent of \(X\) and has mean zero.

They provide a brief history of the field:

  • Early 19th century: Legendre and Gauss published papers on the method of least squares, implementing the earliest form of linear regresssion, used for predicting quantitative values
  • 1936: In order to predict categorical (qualitative) values Fisher proposed linear discriminant analysis
  • 1940s: Various authors put forth an alternate approach called logistic regression
  • Early 1970s: Nelder and Wedderburn coined the term generalized linear models for an entire class of statistical learning methods that include both linear and logistic regression as special cases
  • Mid 1980s: Breiman, Friedman, Olshen and Stone introduced classification and regression trees
  • 1986: Hastie and Tibshirani coined the term generalized additive models for a class of non-linear extensions to generalized linear models

Why Estimate \(f\)?

There are two main reasons to estimate \(f\):

  • prediction, accurately predicting the response for future observations, and
  • inference, better understanding the relationship between the response and the predictors.

Prediction

In many situations we have a set of features \(X\) but the output \(Y\) cannot be easily obtained and since \(\epsilon\) averages to 0 we are concerned with finding

\begin{equation} \hat Y = \hat f(X) \tag{2}, \end{equation}

where we don’t care that \(\hat f = f\), so long as it accurately predicts \(Y\).

The difference between \(\hat f\) and \(f\) is composed of reducible and irreducible errors. We can reduce the error by selecting the appropriate statistical learning technique, but even if we were to select the most perfect model for \(f\) there might still be an error to the prediction due to \(\epsilon\), which in this context I understand represents those features (variables) potentially unmeasured and unaccounted for in our model, but that could influence the prediction. My intuition is that we can’t measure everything that influences a given phenomenon, as that would involve “measuring” the entire universe, hence \(\epsilon\) and “modeling.”

There is an upper bound on the accuracy of predictions generated by the models we create, and this bound is almost always unknown.

Inference

Sometimes we want to just know how the individual components \(X_1, X_2, ... X_p\) of \(X\) affect \(Y\). In this case we need to consider the exact form of \(\hat f\). Inference might attempt to answer

  • Which predictors (features) are associated with the response? Some predictors might have negligible
  • What is the relationship between the response and each predictor? (e.g. positive/negative, presence of other predictors)
  • Can the relationship between Y and each predictor be adequately summarized using a linear equation, or is the relationship more complicated?

Prediction vs Inference

There seems to be a tradeoff between model accuracy and model interpretability. Linear models are simple and allow for interpretable inference, but may yield less-accurate predictions. Non-linear approaches may yield more-accurate predictions, but these models are less-interpretable and inference is more challenging than in linear models.

Types of Statistical Methods

Parametric vs Non-parametric Methods

Most statistical methods for finding the function \(\hat f\) can be characterized as parametric or non-parametric.

Parametric Methods

Parametric methods follow a two-step process:

  1. Make an assumption about the form of \(f\), e.g. a linear model
  2. After a model has been selected, we use a procedure to find the parameters \(\beta\) such that \begin{equation} Y \approx \beta_0 + \beta _1X_1 + \beta_2X_2 + … + \beta_pX_p \tag{3}, \end{equation} or in other words, we fit or train the model using the available data. The most common approach to fitting the parameters is called ordinary least squares.

It is easier to estimate a set of parameters \(\beta\) than to fit an arbitrary function \(f\). If the chosen model is too far from \(f\) our estimate will be poor. We can choose a more “flexible” model (I think this means to choose a higher-order polynomial model?), but this (a) requires fitting a greater number of parameters and (b) can lead to overfitting the data such that the model follows the error, or noise, too closely.

Non-parametric Methods

Non-parametric methods make no assumption about the particular form of \(f\) and thus can potentially fit a wider range of shapes of \(f\). There is no danger, as in parametric methods, of estimating \(f\) to have a completely different form than it really has. A far larger dataset than is typically required for parametric methods is, however, required to accurately estimate \(f\) due to the non-reduced problem-space. One example is the thin-plate spline, which can be used to estimate an \(f\) as close as possible to the data. In this approach we must select a level of “smoothness” where a low level would lead to overfitting (and therefore poor generalization). There are methods to select the “correct” amount of smoothness.

Supervised vs Unsupervised Learning

Everything discussed above, many classical statistical learning methods, such as linear regression, logistic regression, as well as more modern approaches such as GAM, boosting, and support vector machines are supervised methods.

Unsupervised learning is a more challenging domain where for every observation \(i = 1,...,n\), we observe a vector of measurements \(x_i\) but don’t have an associated response \(y_i\). One tool we can apply in this situation is cluster analysis. In cluster analysis we try to separate out the data into discernible groups. Algorithms are necessary for applying cluster analysis to multi-dimensional inputs that we cannot visualize easily, as with 2d inputs.

There are also semi-supervised learning problems where we have an associated response \(y_i\) for some but not all \(x_i\).

Regression vs Classification

We tend to refer to problems with a quantitative response variable as regression and those with a qualitative (categorical) response variable as classification. We tend to select methods based on whether the response is quantitative or qualitative. Some methods are particularly suited to a quantitative response variable (e.g. linear regression); other to a qualitative response (e.g. logistic regression); and others are suitable to both (e.g. \(K\)-nearest neighbors and boosting). Whether the inputs are quantitative or qualitative is mostly not an issue, provided the the qualitative variables are properly coded before analysis is performed.

Measuring Quality of Fit

In a regression setting, a common measure of the quality of fit of a statistical model is mean squared error, given by the sum of the squared differences between actual values and values predicted by our model: \begin{equation} MSE = \frac{1}{n}\sum_{i=0}^n (y_i - \hat f(x_i))^2 .\tag{4} \end{equation}

We are particularly interested in MSE for data our model has never seen. In order to achieve this, we separate our dataset into training and test portions. The training portion is used to fit the model, while the test portion is used to assess the quality of the trained model.

Overfitting

Degrees of freedom is a quantity that summarizes the amount of flexibility for a curve. In model selection we should be aware that as the degrees of freedom in our model increase the training MSE will continue to decrease, while the test MSE is likely to decrease until an inflection point of a characteristic U-shaped curve, beyond which we will be overfitting the data – yielding a small training MSE but an increasing test MSE. This happens due to our statistical learning procedure finding patterns in the data caused by random chance rather than by true properties of the unknown \(f\).

Overfitting refers specifically to the case in which a less flexible model would have yielded a smaller test MSE.

The particular data we are analyzing has great influence on the flexibility point at which test MSE is minimized. There are a variety of techniques to estimate this minimum point; an important one is cross-validation, a method for estimating test MSE using training data.

Bias-Variance Tradeoff

The expected test MSE for a given value \(x_0\) can be decomposed into the sum of three quantities

\begin{equation} E(y_0- \hat f(x_0))^2 = \mathrm{Var}(\hat f(x_0)) + [\mathrm{Bias}(\hat f(x_0))]^2 + \mathrm{Var}(\epsilon), \tag{5} \end{equation}

where \(E(y_0- \hat f(x_0))^2\) defines the expected test MSE, and refers to the average test MSE that we would obtain if we repeatedly estimated \(f\) using a large number of training sets, and tested each at \(x_0\) (from the test set). Variance is inherently non-negative, as is squared bias, so the test MSE can never be less than \(\mathrm{Var}(\epsilon)\).

Variance

Variance refers to the amount by which \(\hat f\) would change if we estimated it using a different training data set. A more-flexible model will have a high variance because it will be highly correlated with the specific points in a given dataset – changing those points out would significantly impact the model. A more linear model, on the other hand, will move only slightly if we sub some training data out. Ideally \(\hat f\) should not vary too much between training sets.

  • Generally, more-flexible methods have higher variance.

Bias

Bias refers to the error that is introduced by approximating a real-life problem, which may be extremely complicated, by a much simpler model. The more non-linear the \(f\) we are approximating, the more bias a more-linear model will have – no matter how much data we feed into the model, it will be unable to produce a \(\hat f \approx f\).

  • Generally, more-flexible methods result in less bias.

The U-Shape of the Test MSE Explained

Generally, as we use more flexible methods, the variance will increase and the bias will decrease. The relative rate of change of these two quantities determines whether the test MSE increases or decreases. As we increase the flexibility of a class of methods, the bias tends to initially decrease faster than the variance increases. Consequently, the expected test MSE declines. However, at some point increasing flexibility has little impact on the bias but starts to significantly increase the variance. When this happens the test MSE increases.