Factors Driving Algorithm Choices and Performance—Complexity and Data
"""factors [which] affect the overall performance of a predictive algorithm
complexity of the problem
complexity of the model used
amount of training data available
Contrast Between a Simple Problem and a Complex Problem
out-of-sample error
problem complexity
complexity of the decision boundaries
mixture model
the points in Figure 3-2 are drawn from several distributions for the light points and several different ones for dark.
Contrast Between a Simple Model and a Complex Model
Figure 3-4: Linear model fit to simple dat
Figure 3-5: Linear model fit to complex data
Figure 3-6: Ensemble model fit to complex data
Figure 3-7: Linear model fit to small sample of complex data
Figure 3-8: Ensemble model fit to small sample of complex data
Factors Driving Predictive Algorithm Performance
shape of the data
aspect ratio
In biology, genomic data sets can easily contain 10,000 to 50,000 attributes. Even with tens of thousands of individual experiments (rows of data), a genomic data set may not be enough to train a complex ensemble model. A linear model may give equivalent or better performance.
In some natural language processing problems, the attributes are words, and rows are documents. Entries in the matrix of attributes are the number of times a word appears in a document. The number of columns is the vocabulary size for a docu- ment collection. Depending on preprocessing (for example, removing common words like a, and, and of), the vocabulary can be from a few thousand to a few tens of thousands. The attribute matrix for text becomes very wide when n-grams are counted alongside words.
Once again, a linear model may give equivalent or better performance than a more complicated ensemble model.
Choosing an Algorithm: Linear or Nonlinear?
Linear models are preferable when the data set has more columns than rows or when the underlying problem is simple.
Nonlinear models are preferable for complex problems with many more rows than columns of data.
training time
- Fast linear techniques train much faster than nonlinear techniques.