Chapter 3: Predictive Model Building: Balancing Performance, Complexity, and Big Data

"""Achieving performance goals involves three factors

complexity of the problem
complexity of the algorithmic model employed
amount and richness of the data available

The Basic Problem: Understanding Function Approximation p.76

Intro (S. 76-79):

Notationen: Matrix-Schreibweise, X, Y
- Mathe-Index (startet mit 1) vs. Python-Index (startet mit 0)
Regression: MSE, MAE
misclassification error (Bowles S. 79): Was meint Bowles damit genau?
- Hausaufgabe: 'rausfinden und als Formel angeben bezogen auf die Begriffe in https://en.wikipedia.org/wiki/Confusion_matrix

function approximation problem

Matrix X: predictors, regressors, features, attributes, unabhängige Variablen
Vektor Y: target, label, outcome, abhängige Variable

feature engineering

Determining what attributes to use for making predictions
Data cleaning and feature engineering take 80 percent to 90 percent of a data scientist’s time

Working with Training Data

x_i (with a single index) will refer to the ith row of X

x 2 would be a row vector containing the values F, 250, 32.
JB: Index startet in Python mit 0, aber in der Mathematik (und auch in den Formeln bei Bowles) mit 1

The targets corresponding to each row in X are arranged in a column vector Y

Assessing Performance of Predictive Models 76

"""regression problem ... describe the error as being the numeric difference between them

mean squared error (MSE)

mean absolute error (MAE)

"""classification problem

misclassification error

Factors Driving Algorithm Choices and Performance—Complexity and Data p.79

"""factors [which] affect the overall performance of a predictive algorithm

complexity of the problem

complexity of the model used

amount of training data available

Contrast Between a Simple Problem and a Complex Problem

out-of-sample error

problem complexity

complexity of the decision boundaries

mixture model

the points in Figure 3-2 are drawn from several distributions for the light points and several different ones for dark.

Contrast Between a Simple Model and a Complex Model

Figure 3-4: Linear model fit to simple dat

Figure 3-5: Linear model fit to complex data

Figure 3-6: Ensemble model fit to complex data

Figure 3-7: Linear model fit to small sample of complex data

Figure 3-8: Ensemble model fit to small sample of complex data

Factors Driving Predictive Algorithm Performance

shape of the data

aspect ratio

In biology, genomic data sets can easily contain 10,000 to 50,000 attributes. Even with tens of thousands of individual experiments (rows of data), a genomic data set may not be enough to train a complex ensemble model. A linear model may give equivalent or better performance.

In some natural language processing problems, the attributes are words, and rows are documents. Entries in the matrix of attributes are the number of times a word appears in a document. The number of columns is the vocabulary size for a docu- ment collection. Depending on preprocessing (for example, removing common words like a, and, and of), the vocabulary can be from a few thousand to a few tens of thousands. The attribute matrix for text becomes very wide when n-grams are counted alongside words.

Once again, a linear model may give equivalent or better performance than a more complicated ensemble model.

Choosing an Algorithm: Linear or Nonlinear?

Linear models are preferable when the data set has more columns than rows or when the underlying problem is simple.

Nonlinear models are preferable for complex problems with many more rows than columns of data.

training time

Fast linear techniques train much faster than nonlinear techniques.

Measuring the Performance of Predictive Models p.88

Performance Measures for Different Types of Problems

Performance measures for regression problems

mean squared error (MSE)
mean absolute error (MAE)
root MSE (RMSE, which is the square root of MSE)
Listing 3-1: Comparison of MSE, MAE and RMSE—regressionErrorMeasures.py
variance (mean squared deviation from the mean)
standard deviation (square root of variance)
"""For example, if the MSE of the prediction error is roughly the same as the target variance (or the RMSE is roughly the same as target standard deviation), the prediction algorithm is not performing well. You could replace the prediction algorithm with a simple calculation of the mean of the targets and perform as well.
"""The errors in Listing 3-1 have RMSE that’s about half the standard deviation of the targets. That is fairly good performance.
histogram of the error
tail behavior (quantile or decile boundaries)
degree of normality

Classification problems

misclassification error rates
Generally, algorithms for doing classification can present predictions in the form of a probability instead of a hard click versus not-click decision. The algorithms considered in this book all output probabilities ... the data scientist has the option to use 50 percent as a threshold
confusion matrix or contingency table
- confusionMatrix() ... takes the predictions, the corresponding actual values (labels), and a threshold value as input
receiver operating characteristic (ROC)
- The ROC curve plots the true positive rate (abbreviated TPR) versus the false positive rate (FPR).
area under the curve (AUC)
- A perfect classifier has an AUC of 1.0
- random guessing has an AUC of 0.5

Simulating Performance of Deployed Models

training set

test set

[validation set?]

benutzt für n-fold cross validation?

CODE

Listing 3-1: Comparison of MSE, MAE and RMSE—regressionErrorMeasures.py
- Figure 3-9: Confusion matrix example
Listing 3-2: Measuring Performance for Classifier Trained on Rocks-Versus-Mines— classifierPerformance_RocksVMines.py
- Table 3-2: Dependence of Misclassification Error on Decision Threshold
- Table 3-3: Cost of Mistakes for Different Decision Thresholds
- Figure 3-10: In-sample ROC for rocks-versus-mines classifier
- Figure 3-11: Out-of-sample ROC for rocks-versus-mines classifier

Achieving Harmony Between Model and Data p.101

Choosing a Model to Balance Problem Complexity, Model Complexity, and Data Set Size

Using Forward Stepwise Regression to Control Overfitting

algorithm for best subset selection

impose a constraint (say nCol) on the number of columns

Figure 3-13: Wine quality prediction error using forward stepwise regression

Evaluating and Understanding Your Predictive Model

Several other plots are helpful in understanding the performance of a trained algorithm and can point the way to making improvements in its performance.

Figure 3-14: Actual taste scores versus predictions generated with forward stepwise regression
Figure 3-15: Histogram of wine taste prediction error with forward stepwise regression

The number of attributes to be incorporated in the solution can be called a complexity parameter. Models with larger complexity parameters have more free parameters and are more likely to overfit the data than less-complex models.

Control Overfitting by Penalizing Regression Coefficients—Ridge Regression

first introduction to penalized linear regression

coefficient penalized regression [...] making all the coefficients smaller instead of making some of them zero.

Equation 3-15: Ridge regression minimization problem

[ridge penalty:] β T β term is the square of the Euclidean norm of β (the vector of coefficients)
If α = 0, the problem becomes ordinary least squares regression. When α becomes large, β (the vector of coefficients) approaches zero, and only the constant term β_0 is available to predict the labels y_i
siehe auch "Equation 4-7: Penalty applied to coefficients (betas)" (S. 128)

Listing 3-5: Predicting Wine Taste with Ridge Regression—ridgeWine.py

alphaList = [0.1**i for i in [0,1, 2, 3, 4, 5, 6]]

Figure 3-16: Wine quality prediction error using ridge regression

x-Achse: -log(alpha)
y-Achse: RMS-Error

Figure 3-17: Actual taste scores versus predictions generated with ridge regression

Figure 3-18: Histogram of wine taste prediction error with ridge regression

CODE

Listing 3-3: Forward Stepwise Regression: Wine Quality Data—fwdStepwiseWine.py
- Figure 3-13: Wine quality prediction error using forward stepwise regression
- Listing 3-4: Forward Stepwise Regression Output—fwdStepwiseWineOutput.txt
- Figure 3-14: Actual taste scores versus predictions generated with forward stepwise regression
- Figure 3-15: Histogram of wine taste prediction error with forward stepwise regression
Listing 3-5: Predicting Wine Taste with Ridge Regression—ridgeWine.py
- Figure 3-16: Wine quality prediction error using ridge regression
- Listing 3-6: Ridge Regression Output—ridgeWineOutput.txt
- Figure 3-17: Actual taste scores versus predictions generated with ridge regression
- Figure 3-18: Histogram of wine taste prediction error with ridge regression
Listing 3-7: Rocks Versus Mines Using Ridge Regression—classifierRidgeRocksVMines.py
- Listing 3-8: Output from Classification Model for Rocks Versus Mines Using Ridge Regression—classifierRidgeRocksVMinesOutput.txt
- Figure 3-19: AUC for the rocks-versus-mines classifier using ridge regression
- Figure 3-20: Plot of actual versus prediction for the rocks-versus-mines classifier using ridge regression