CRISP-DM phases, tasks, outputs¶
Figure 3: Generic tasks (bold) and outputs (italic) of the CRISP-DM reference model CRISP-DM 1.0. p. 12)
CRISP-DM 1.0 Step-by-step data mining guide Pete Chapman (NCR), Julian Clinton (SPSS), Randy Kerber (NCR), Thomas Khabaza (SPSS), Thomas Reinartz (DaimlerChrysler), Colin Shearer (SPSS) and Rüdiger Wirth (DaimlerChrysler). © 2000 SPSS Inc. CRISPMWP-1104
Business Understanding
Determine Business Objectives
Business Objectives
Business Success Criteria
Assess Situation
Inventory of Resources
Requirements, Assumptions, and Constraints
Risks and Contingencies
Compile a glossary of terminology relevant to the project. This should include at least two components:
(1) A glossary of relevant business terminology, which forms part of the business understanding
(2) A glossary of data mining terminology, illustrated with examples relevant to the business available to the project problem in question
Costs and Benefits
Determine Data Mining Goals
Data Mining Goals
Data Mining Success Criteria
Data Understanding
Collect Initial Data
Initial Data Collection Report
Describe Data
Data Description Report
Volumetric analysis of data
Identify data and method of capture
Access data sources
Use statistical analyses if appropriate
Report tables and their relations
Check data volume, number of multiples, complexity
Note if the data contain free text entries
Attribute types and values
Check accessibility and availability of attributes
Check attribute types (numeric, symbolic, taxonomy, etc.)
Check attribute value ranges
Analyze attribute correlations
Understand the meaning of each attribute and attribute value in business terms
For each attribute, compute basic statistics (e.g., compute distribution, average, max, min, standard deviation, variance, mode, skewness, etc.)
Analyze basic statistics and relate the results to their meaning in business terms
Decide if the attribute is relevant for the specific data mining goal
Determine if the attribute meaning is used consistently
Interview domain experts to obtain their opinion of attribute relevance
Decide if it is necessary to balance the data (based on the modeling techniques to be used)
Analyze key relationships
Check amount of overlaps of key attribute values across tables
Explore Data
Data Exploration Report
Verify Data Quality
Identify special values and catalog their meaning
Review keys, attributes
Check coverage (e.g., whether all possible values are represented)
Check keys
Verify that the meanings of attributes and contained values fit together
Identify missing attributes and blank fields
Establish the meaning of missing data
Check for attributes with different values that have similar meanings (e.g., low fat, diet)
Check spelling and format of values (e.g., same value but sometimes beginning with a lower-case letter, sometimes with an upper-case letter)
Check for deviations, and decide whether a deviation is “noise” or may indicate an interesting phenomenon
Check for plausibility of values, (e.g., all fields having the same or nearly the same values)
Good idea!
Review any attributes that give answers that conflict with common sense (e.g., teenagers with high income levels).
Use visualization plots, histograms, etc. to reveal inconsistencies in the data.
Data quality in flat files
If data are stored in flat files, check which delimiter is used and whether it is used consistently within all attributes
If data are stored in flat files, check the number of fields in each record to see if they coincide
Noise and inconsistencies between sources
Check consistencies and redundancies between different sources
Plan for dealing with noise
Detect the type of noise and which attributes are affected
Good idea!
Remember that it may be necessary to exclude some data since they do not exhibit either positive or negative behavior (e.g., to check on customers’ loan behavior, exclude all those who have never borrowed, do not finance a home mortgage, those whose mortgage is nearing maturity, etc.).
Data Quality Report
Data Preparation
Select Data
Rationale for Inclusion/ Exclusion
Clean Data
Data Cleaning Report
Construct Data
Derived Attributes
Generated Records
Integrate Data
Merged Data
Format Data
Reformatted Data
Dataset Description
Select Modeling Techniques
Modeling Technique
Modeling Assumptions
Generate Test Design
Test Design
Build Model
Parameter Settings
Model Descriptions
Assess Model
Model Assessment
Revised Parameter Settings
Evaluate Results
Assessment of Data Mining Results w.r.t. Business Success Criteria
Approved Models
Review Process
Review of Process
Determine Next Steps
List of Possible Actions
Plan Deployment
Deployment Plan
Plan Monitoring and Maintenance
Monitoring and Maintenance Plan
Produce Final Report
Final Report
Final Presentation
Review Project
Experience Documentation