Factors Driving Algorithm Choices and Performance—Complexity and Data

"""factors [which] affect the overall performance of a predictive algorithm

complexity of the problem

complexity of the model used

amount of training data available

Contrast Between a Simple Problem and a Complex Problem

out-of-sample error

problem complexity

complexity of the decision boundaries

mixture model

the points in Figure 3-2 are drawn from several distributions for the light points and several different ones for dark.

Contrast Between a Simple Model and a Complex Model

Figure 3-4: Linear model fit to simple dat

Figure 3-5: Linear model fit to complex data

Figure 3-6: Ensemble model fit to complex data

Figure 3-7: Linear model fit to small sample of complex data

Figure 3-8: Ensemble model fit to small sample of complex data

Factors Driving Predictive Algorithm Performance

shape of the data

aspect ratio

In biology, genomic data sets can easily contain 10,000 to 50,000 attributes. Even with tens of thousands of individual experiments (rows of data), a genomic data set may not be enough to train a complex ensemble model. A linear model may give equivalent or better performance.

In some natural language processing problems, the attributes are words, and rows are documents. Entries in the matrix of attributes are the number of times a word appears in a document. The number of columns is the vocabulary size for a docu- ment collection. Depending on preprocessing (for example, removing common words like a, and, and of), the vocabulary can be from a few thousand to a few tens of thousands. The attribute matrix for text becomes very wide when n-grams are counted alongside words.

Once again, a linear model may give equivalent or better performance than a more complicated ensemble model.

Choosing an Algorithm: Linear or Nonlinear?

Linear models are preferable when the data set has more columns than rows or when the underlying problem is simple.

Nonlinear models are preferable for complex problems with many more rows than columns of data.

training time

Fast linear techniques train much faster than nonlinear techniques.