Chapter 3: Predictive Model Building: Balancing Performance, Complexity, and Big Data
"""Achieving performance goals involves three factors
- complexity of the problem
- complexity of the algorithmic model employed
- amount and richness of the data available
The Basic Problem: Understanding Function Approximation p.76
Intro (S. 76-79):
- Notationen: Matrix-Schreibweise, X, Y
- Mathe-Index (startet mit 1) vs. Python-Index (startet mit 0)
- Regression: MSE, MAE
- misclassification error (Bowles S. 79): Was meint Bowles damit genau?
- Hausaufgabe: 'rausfinden und als Formel angeben bezogen auf die Begriffe in https://en.wikipedia.org/wiki/Confusion_matrix
function approximation problem
- Matrix X: predictors, regressors, features, attributes, unabhängige Variablen
- Vektor Y: target, label, outcome, abhängige Variable
feature engineering
- Determining what attributes to use for making predictions
- Data cleaning and feature engineering take 80 percent to 90 percent of a data scientist’s time
Working with Training Data
x_i (with a single index) will refer to the ith row of X
- x 2 would be a row vector containing the values F, 250, 32.
- JB: Index startet in Python mit 0, aber in der Mathematik (und auch in den Formeln bei Bowles) mit 1
The targets corresponding to each row in X are arranged in a column vector Y
Assessing Performance of Predictive Models 76
"""regression problem ... describe the error as being the numeric difference between them
mean squared error (MSE)
mean absolute error (MAE)
"""classification problem
misclassification error
Factors Driving Algorithm Choices and Performance—Complexity and Data p.79
"""factors [which] affect the overall performance of a predictive algorithm
complexity of the problem
complexity of the model used
amount of training data available
Contrast Between a Simple Problem and a Complex Problem
out-of-sample error
problem complexity
complexity of the decision boundaries
mixture model
the points in Figure 3-2 are drawn from several distributions for the light points and several different ones for dark.
Contrast Between a Simple Model and a Complex Model
Figure 3-4: Linear model fit to simple dat
Figure 3-5: Linear model fit to complex data
Figure 3-6: Ensemble model fit to complex data
Figure 3-7: Linear model fit to small sample of complex data
Figure 3-8: Ensemble model fit to small sample of complex data
Factors Driving Predictive Algorithm Performance
shape of the data
aspect ratio
In biology, genomic data sets can easily contain 10,000 to 50,000 attributes. Even with tens of thousands of individual experiments (rows of data), a genomic data set may not be enough to train a complex ensemble model. A linear model may give equivalent or better performance.
In some natural language processing problems, the attributes are words, and rows are documents. Entries in the matrix of attributes are the number of times a word appears in a document. The number of columns is the vocabulary size for a docu- ment collection. Depending on preprocessing (for example, removing common words like a, and, and of), the vocabulary can be from a few thousand to a few tens of thousands. The attribute matrix for text becomes very wide when n-grams are counted alongside words.
Once again, a linear model may give equivalent or better performance than a more complicated ensemble model.
Choosing an Algorithm: Linear or Nonlinear?
Linear models are preferable when the data set has more columns than rows or when the underlying problem is simple.
Nonlinear models are preferable for complex problems with many more rows than columns of data.
training time
- Fast linear techniques train much faster than nonlinear techniques.
Measuring the Performance of Predictive Models p.88
Performance Measures for Different Types of Problems
Performance measures for regression problems
- mean squared error (MSE)
- mean absolute error (MAE)
- root MSE (RMSE, which is the square root of MSE)
- Listing 3-1: Comparison of MSE, MAE and RMSE—regressionErrorMeasures.py
- variance (mean squared deviation from the mean)
- standard deviation (square root of variance)
- """For example, if the MSE of the prediction error is roughly the same as the target variance (or the RMSE is roughly the same as target standard deviation), the prediction algorithm is not performing well. You could replace the prediction algorithm with a simple calculation of the mean of the targets and perform as well.
- """The errors in Listing 3-1 have RMSE that’s about half the standard deviation of the targets. That is fairly good performance.
- histogram of the error
- tail behavior (quantile or decile boundaries)
- degree of normality
Classification problems
- misclassification error rates
- Generally, algorithms for doing classification can present predictions in the form of a probability instead of a hard click versus not-click decision. The algorithms considered in this book all output probabilities ... the data scientist has the option to use 50 percent as a threshold
- confusion matrix or contingency table
- confusionMatrix() ... takes the predictions, the corresponding actual values (labels), and a threshold value as input
- receiver operating characteristic (ROC)
- The ROC curve plots the true positive rate (abbreviated TPR) versus the false positive rate (FPR).
- area under the curve (AUC)
- A perfect classifier has an AUC of 1.0
- random guessing has an AUC of 0.5
Simulating Performance of Deployed Models
training set
test set
[validation set?]
- benutzt für n-fold cross validation?
CODE
- Listing 3-1: Comparison of MSE, MAE and RMSE—regressionErrorMeasures.py
- Figure 3-9: Confusion matrix example
- Listing 3-2: Measuring Performance for Classifier Trained on Rocks-Versus-Mines— classifierPerformance_RocksVMines.py
- Table 3-2: Dependence of Misclassification Error on Decision Threshold
- Table 3-3: Cost of Mistakes for Different Decision Thresholds
- Figure 3-10: In-sample ROC for rocks-versus-mines classifier
- Figure 3-11: Out-of-sample ROC for rocks-versus-mines classifier
Achieving Harmony Between Model and Data p.101
Choosing a Model to Balance Problem Complexity, Model Complexity, and Data Set Size
Using Forward Stepwise Regression to Control Overfitting
algorithm for best subset selection
impose a constraint (say nCol) on the number of columns
Figure 3-13: Wine quality prediction error using forward stepwise regression
Evaluating and Understanding Your Predictive Model
Several other plots are helpful in understanding the performance of a trained algorithm and can point the way to making improvements in its performance.
- Figure 3-14: Actual taste scores versus predictions generated with forward stepwise regression
- Figure 3-15: Histogram of wine taste prediction error with forward stepwise regression
The number of attributes to be incorporated in the solution can be called a complexity parameter. Models with larger complexity parameters have more free parameters and are more likely to overfit the data than less-complex models.
Control Overfitting by Penalizing Regression Coefficients—Ridge Regression
first introduction to penalized linear regression
coefficient penalized regression [...] making all the coefficients smaller instead of making some of them zero.
Equation 3-15: Ridge regression minimization problem
- [ridge penalty:] β T β term is the square of the Euclidean norm of β (the vector of coefficients)
- If α = 0, the problem becomes ordinary least squares regression. When α becomes large, β (the vector of coefficients) approaches zero, and only the constant term β_0 is available to predict the labels y_i
- siehe auch "Equation 4-7: Penalty applied to coefficients (betas)" (S. 128)
Listing 3-5: Predicting Wine Taste with Ridge Regression—ridgeWine.py
-
alphaList = [0.1**i for i in [0,1, 2, 3, 4, 5, 6]]
Figure 3-16: Wine quality prediction error using ridge regression
- x-Achse: -log(alpha)
- y-Achse: RMS-Error
Figure 3-17: Actual taste scores versus predictions generated with ridge regression
Figure 3-18: Histogram of wine taste prediction error with ridge regression
CODE
- Listing 3-3: Forward Stepwise Regression: Wine Quality Data—fwdStepwiseWine.py
- Figure 3-13: Wine quality prediction error using forward stepwise regression
- Listing 3-4: Forward Stepwise Regression Output—fwdStepwiseWineOutput.txt
- Figure 3-14: Actual taste scores versus predictions generated with forward stepwise regression
- Figure 3-15: Histogram of wine taste prediction error with forward stepwise regression
- Listing 3-5: Predicting Wine Taste with Ridge Regression—ridgeWine.py
- Figure 3-16: Wine quality prediction error using ridge regression
- Listing 3-6: Ridge Regression Output—ridgeWineOutput.txt
- Figure 3-17: Actual taste scores versus predictions generated with ridge regression
- Figure 3-18: Histogram of wine taste prediction error with ridge regression
- Listing 3-7: Rocks Versus Mines Using Ridge Regression—classifierRidgeRocksVMines.py
- Listing 3-8: Output from Classification Model for Rocks Versus Mines Using Ridge Regression—classifierRidgeRocksVMinesOutput.txt
- Figure 3-19: AUC for the rocks-versus-mines classifier using ridge regression
- Figure 3-20: Plot of actual versus prediction for the rocks-versus-mines classifier using ridge regression