Chapter 2: Understand the Problem by Understanding the Data

The Anatomy of a New Problem

Different Types of Attributes and Labels Drive Modeling Choices

  • numeric (inc. ordinal)
  • categorical
    • no order relation
  • """When the labels are numeric, the problem is called a regression problem. When the labels are categorical, the problem is called a classifi cation problem. If the categorical target takes only two values, the problem is called a binary classifica- tion problem. If it takes more than two values, the problem is called a multiclass classification problem .

Things to Notice about Your New Data Set 26

  • Items to Check
    • Number of rows and columns
    • Number of categorical variables and number of unique values for each
    • Number of Missing values
    • Summary statistics for attributes and labels

Classification Problems: Detecting Unexploded Mines Using Sonar

Physical Characteristics of the Rocks Versus Mines Data Set

Abschätzung Laufzeiten

The second important observation regarding row and column counts is that if the data set has many more columns than rows, you may be more likely to get the best prediction with penalized linear regression

Statistical Summaries of the Rocks versus Mines Data Set

"""descriptive statistics for the numeric variables and a count of the unique categories in each categorical attribute

Visualization of Outliers Using Quantile‐Quantile Plot

"""... outliers ... the last quartile has a range of 4.6, which is 100 times larger than the range of the other quartiles.

stats.probplot(colData, dist="norm", plot=pylab) pylab.show()

Statistical Characterization of Categorical Attributes

"""check how many categories they have and how many examples there are from each category.

"""The popular Random Forests package written by Breiman and Cutler (the inventors of the algorithm) has a cutoff of 32 categories. If an attribute has more than 32 categories, you’ll need to aggregate them.

stratified sampling

How to Use Python Pandas to Summarize the Rocks Versus Mines Data Set

"""You can think of a data frame as a table or matrix-like structure as in Table 2-1 . The data frame is oriented with a row representing a single case (experiment, example, measurement) and columns representing particular attributes. The structure is matrix-like, but not a matrix because the elements in various columns may be of different types. Formally, a matrix is defined over a field (like the real numbers, binary numbers, complex numbers), and all the entries in a matrix are elements from that field.

The data frame structure enables access to individual elements through an index roughly similar to addressing an entry in a Python Numpy array or a list of lists.

Similarly, index slicing can be used to address an entire row or column from the array.

In addition, the Pandas data frame enables addressing rows and columns by means of their names.

Visualizing Properties of the Rocks versus Mines Data Set

Visualizing with Parallel Coordinates Plots

Visualizing Interrelationships between Attributes and Labels

crossplot the attributes with the labels / scatter plots

  • Figures 2-4 and 2-5 show the scatter plots for two pairs of attributes from the rocks versus mines data set
  • correlation
  • [JB: feature engineering: delta x, also Ableitung bilden?]

Pearson’s correlation coefficient

  • Equation 2-2: Average values of the entries in u
  • Equation 2-3: Subtract the average from each element in u
  • Equation 2-4: Definition of Pearson’s correlation coefficient

Visualizing Attribute and Label Correlations Using a Heat Map

Perfect correlation (correlation = 1) between attributes means that you may have made a mistake and included the same thing twice.

Very high correlation between a set of attributes (pairwise correlations > 0.7) is known as multicol- linearity and can lead to unstable estimates.

Real‐Valued Predictions with Factor Variables: How Old Is Your Abalone?

"""Normalization in this case means centering and scaling each column so that a unit of attribute number 1 means the same thing as a unit of attribute number 2. """ (55)

Parallel Coordinates for Regression Problems—Visualize Variable Relationships for Abalone Problem

How to Use Correlation Heat Map for Regression—Visualize Pair‐Wise Correlations for the Abalone Problem 60

Real‐Valued Predictions Using Real‐Valued Attributes: Calculate How Your Wine Tastes

Multiclass Classification Problem: What Type of Glass Is That?

outlier behavior

  • One is that the problem is a classifi cation problem. There’s not necessarily any conti- nuity in relationship between attribute values and class membership—no reason to expect proximity of attribute values across classes.
  • Another unique feature of the glass data is that it is somewhat unbalanced.

Summary 62