CRISP-DM phases, tasks, outputs

Figure 3: Generic tasks (bold) and outputs (italic) of the CRISP-DM reference model CRISP-DM 1.0. p. 12)

CRISP-DM 1.0 Step-by-step data mining guide Pete Chapman (NCR), Julian Clinton (SPSS), Randy Kerber (NCR), Thomas Khabaza (SPSS), Thomas Reinartz (DaimlerChrysler), Colin Shearer (SPSS) and Rüdiger Wirth (DaimlerChrysler). © 2000 SPSS Inc. CRISPMWP-1104

https://www.the-modeling-agency.com/crisp-dm.pdf

Business Understanding

  • Determine Business Objectives

    • Background

    • Business Objectives

    • Business Success Criteria

  • Assess Situation

    • Inventory of Resources

    • Requirements, Assumptions, and Constraints

    • Risks and Contingencies

    • Terminology

      • Compile a glossary of terminology relevant to the project. This should include at least two components:

      • (1) A glossary of relevant business terminology, which forms part of the business understanding

      • (2) A glossary of data mining terminology, illustrated with examples relevant to the business available to the project problem in question

    • Costs and Benefits

  • Determine Data Mining Goals

    • Data Mining Goals

    • Data Mining Success Criteria

Data Understanding

  • Collect Initial Data

    • Initial Data Collection Report

  • Describe Data

    • Data Description Report

    • Activities

      • Volumetric analysis of data

        • Identify data and method of capture

        • Access data sources

        • Use statistical analyses if appropriate

        • Report tables and their relations

        • Check data volume, number of multiples, complexity

        • Note if the data contain free text entries

      • Attribute types and values

        • Check accessibility and availability of attributes

        • Check attribute types (numeric, symbolic, taxonomy, etc.)

        • Check attribute value ranges

        • Analyze attribute correlations

        • Understand the meaning of each attribute and attribute value in business terms

        • For each attribute, compute basic statistics (e.g., compute distribution, average, max, min, standard deviation, variance, mode, skewness, etc.)

        • Analyze basic statistics and relate the results to their meaning in business terms

        • Decide if the attribute is relevant for the specific data mining goal

        • Determine if the attribute meaning is used consistently

        • Interview domain experts to obtain their opinion of attribute relevance

        • Decide if it is necessary to balance the data (based on the modeling techniques to be used)

      • Keys

        • Analyze key relationships

        • Check amount of overlaps of key attribute values across tables

  • Explore Data

    • Data Exploration Report

  • Verify Data Quality

    • Activities

      • Identify special values and catalog their meaning

      • Review keys, attributes

        • Check coverage (e.g., whether all possible values are represented)

        • Check keys

        • Verify that the meanings of attributes and contained values fit together

        • Identify missing attributes and blank fields

        • Establish the meaning of missing data

        • Check for attributes with different values that have similar meanings (e.g., low fat, diet)

        • Check spelling and format of values (e.g., same value but sometimes beginning with a lower-case letter, sometimes with an upper-case letter)

        • Check for deviations, and decide whether a deviation is “noise” or may indicate an interesting phenomenon

        • Check for plausibility of values, (e.g., all fields having the same or nearly the same values)

      • Good idea!

        • Review any attributes that give answers that conflict with common sense (e.g., teenagers with high income levels).

        • Use visualization plots, histograms, etc. to reveal inconsistencies in the data.

      • Data quality in flat files

        • If data are stored in flat files, check which delimiter is used and whether it is used consistently within all attributes

        • If data are stored in flat files, check the number of fields in each record to see if they coincide

      • Noise and inconsistencies between sources

        • Check consistencies and redundancies between different sources

        • Plan for dealing with noise

        • Detect the type of noise and which attributes are affected

      • Good idea!

        • Remember that it may be necessary to exclude some data since they do not exhibit either positive or negative behavior (e.g., to check on customers’ loan behavior, exclude all those who have never borrowed, do not finance a home mortgage, those whose mortgage is nearing maturity, etc.).

    • Data Quality Report

Data Preparation

  • Select Data

    • Rationale for Inclusion/ Exclusion

  • Clean Data

    • Data Cleaning Report

  • Construct Data

    • Derived Attributes

    • Generated Records

  • Integrate Data

    • Merged Data

  • Format Data

    • Reformatted Data

  • Dataset

    • Dataset Description

Modeling

  • Select Modeling Techniques

    • Modeling Technique

    • Modeling Assumptions

  • Generate Test Design

    • Test Design

  • Build Model

    • Parameter Settings

    • Models

    • Model Descriptions

  • Assess Model

    • Model Assessment

    • Revised Parameter Settings

Evaluation

  • Evaluate Results

    • Assessment of Data Mining Results w.r.t. Business Success Criteria

    • Approved Models

  • Review Process

    • Review of Process

  • Determine Next Steps

    • List of Possible Actions

    • Decision

Deployment

  • Plan Deployment

    • Deployment Plan

  • Plan Monitoring and Maintenance

    • Monitoring and Maintenance Plan

  • Produce Final Report

    • Final Report

    • Final Presentation

  • Review Project

    • Experience Documentation