6  Introduction

The CRISP-DM methodology remains one of the most widely adopted frameworks for data mining projects (Shearer 2000). While various other methodologies have emerged (Mariscal, Marban, and Fernandez 2010; Kurgan and Musilek 2006), CRISP-DM’s emphasis on business understanding has contributed to its longevity (Martinez-Plumed et al. 2021). Figure 6.1 shows an augmented flow of a data science project according to this methodology. You see the steps reflected in the module organization of our material.

A majority of data science projects start with the Business Understanding phase, in which business stakeholders and the data science team collaboratively work to frame the business opportunity and explore possible data science solutions. While most of the work in this phase occurs early in the project, this phase will likely need to be revisited (often many times) as both stakeholders and data science teams better understand the problem and solution space. Most high-value data science projects are very complex with interconnected objectives and systems that require deep knowledge of processes, data, and (often competing) business objectives.

graph TD
    B["Business Understanding"]:::phase --> D["Data Understanding"]:::phase
    D --> B
    D --> P["Data Preparation"]:::phase
    P --> D
    P --> M["Modeling"]:::phase
    M --> P
    M --> E["Evaluation"]:::phase
    E --> M
    E --> DEP["Deployment"]:::phase
    DEP --> MON["Monitoring"]:::phase

    B -.- B1["• <u>Frame problem/solution</u><br/>• Define KPIs<br/>• Set success criteria"]
    D -.- D1["• Identify & explore data<br/>• Verify quality<br/>• Identify subsets"]
    P -.- P1["• Clean & transform<br/>• <u>Feature creation</u><br/>• <u>Feature selection I</u>"]
    M -.- M1["• Build models<br/>• <u>Feature selection II</u><br/>• Evaluate models"]
    E -.- E1["• Evaluate vs KPIs"]
    DEP -.- DEP1["• Pilot & full deployment<br/>• KPI evaluation"]
    MON -.- MON1["• Monitor performance"]

    classDef phase fill:#f9f9f9,stroke:#333,stroke-width:2px
    classDef note fill:#fff,stroke:#999,stroke-width:1px
    
    class B,D,P,M,E,DEP,MON phase
    class B1,D1,P1,M1,E1,DEP1,MON1 note

Figure 6.1: Augmented CRISP-DM

The primary role of this phase is to provide a two-way discovery process between stakeholders and the data science team. It’s a two-way process because stakeholders need to understand the art of the possible from a data science perspective, and the data science team needs to fully understand the business opportunity, including alignment with strategic company objectives, business benefits, measurement criteria, and potential roadblocks. The success of many data science projects is often determined by the effectiveness of the work done in this phase of the CRISP-DM process. Misalignment due to ineffective Business Understanding can result in project failure due to unrealistic expectations on outcomes, failure to identify “showstopper” risks, or lack of support due to poor strategic alignment.

Business Understanding Venn Diagram

Figure 6.2: Business Understanding Venn Diagram

When data scientists have a shallow understanding of business processes and opportunities, they often create trivial solutions that offer little value to the business. They may focus on building solutions where data is easily accessible or on leveraging cutting-edge algorithms. This can result in business users questioning the value of the entire data science approach.

When business users don’t understand the capabilities of data science, they tend to fall back on old patterns of thinking (often business intelligence-based) that do not leverage the power of fully integrated predictive and prescriptive solutions (Debortoli, Müller, and Brocke 2014). Business intelligence applications masquerading as data science projects deliver less value, resulting in the business questioning the value of the entire data science investment (Provost and Fawcett 2013).