51  Understanding Data Science

51.1 Two Cultures Analysis — Data vs. Algorithmic Modeling

Objective

Apply Leo Breiman’s “Two Cultures” framework to evaluate modeling approaches for a specific business problem

You are consulting for a subscription streaming service that wants to predict customer churn. The executive team is divided: the statisticians want to build an interpretable logistic regression model to understand why customers leave, while the ML engineers want to deploy a neural network to achieve the highest possible prediction accuracy.

Breiman (2001) contrasted statistical (data) modeling and algorithmic modeling. The former assumes that the data are generated by a stochastic data model. If we accept the data model as an abstraction of nature’s mechanism, then conclusions about nature will be based on the model. The data model becomes the lens through which we interpret the world.

Model Comparison Framework (1 page)

Data Modeling Approach

Design/Describe a statistical model for churn prediction

  • Specify your assumptions about the data-generating process
  • List the variables you would include and why
  • Explain what business insights this approach would provide

Algorithmic Modeling Approach

Design/Describe an algorithmic solution

  • Choose your ML algorithm and justify the selection
  • Describe how you would find or engineer features (model inputs)
  • Explain how you would optimize for prediction accuracy

Trade-off Analysis (1 page)

Create a decision matrix comparing the approaches across:

  • Interpretability: Can business users understand the model?
  • Accuracy: Prediction performance on holdout data
  • Generalizability: How well will it work on future data?
  • Implementation: Ease of deployment and maintenance
  • Business Value: Which approach better serves organizational needs?

Recommendation & Justification (1 page)

  • Choose your recommended approach and explain why
  • Address the concerns of the opposing camp
  • Design a hybrid solution that combines strengths of both approaches

51.2 Evolution of Data Science Roles

Objective

Understand the evolution of data science roles.

Investigate how data science roles have evolved in the past decade. Interview 2-3 data professionals (statistician, data scientist, data engineer, data analyst, business analyst, etc.) and compare their daily responsibilities.

Write a paper with your findings (about 1,000 words).

51.3 Self-Assessment

Objective

Assess your personal readiness with respect to technical foundation, business foundation, and human skills.

Evaluate your current competencies across:

  • Technical Foundation (math, stats, programming)
  • Business Foundation (project management, business understanding, domain knowledge)
  • Human Skills (communication, teamwork, problem-solving)

Rate your competence on a five-point Likert scale with categories Emerging, Developing, Competent, Proficient, and Mastery. Identify three areas for improvement.

51.4 Career Path Analysis

Objective

Understand current careers in data science.

Research five job postings for data science roles in your target industry. Analyze the required versus the preferred qualifications.

  • Are the postings consistent with respect to job titles and qualifications?
  • Which non-technical skills are emphasized?
  • Which technical skills are emphasized?

51.5 Technology Adaptation & Learning

Objective

Demonstrate ability to quickly learn and evaluate new data science technologies.

Technology Trend Analysis

Identify an emerging technology in data science that was not mainstream 5 years ago (e.g., LLMs, MLOps tools, new visualization frameworks). Research and create a briefing document covering:

  • Technology overview and capabilities
  • Potential applications in business contexts
  • Advantages and limitations
  • Implementation considerations
  • Future outlook

Hands-on Exploration

Spend a day learning and experimenting with this technology. Document your learning process:

  • Resources used
  • Challenges encountered
  • Key insights gained
  • Practical applications identified

Recommendation Report

Write a 2-page executive summary recommending whether your organization should invest in this technology, including:

  • Strategic fit assessment
  • Cost-benefit analysis
  • Implementation roadmap
  • Risk mitigation strategies

51.6 Project Failure Analysis

Objective

Analyze a data science project failure using the seven-stage lifecycle framework and design prevention strategies.

Choose one of these real project failures to analyze:

  1. Theranos Blood Testing: Elizabeth Holmes’ fraudulent blood testing technology
    Key Resources:
  2. IBM Watson for Oncology: Cancer treatment recommendations that were unsafe and incorrect
    Key Resources:
  3. Google Flu Trends: Search-based flu prediction that dramatically overestimated outbreaks
    Key Resources:
  4. Microsoft Tay Chatbot: AI that became racist within 24 hours
    Key Resources:
  5. Cambridge Analytica: Facebook data used for political manipulation
    Key Resources:

Data science projects involve many steps, diverse teams, and projects can fail at any step. The model building phase is not where most unsuccessful projects go off the rails. Industry analyst firm Gartner estimated in 2016 that about 60% of data science projects are failing and admitted two years later that was an underestimate; the real failure rate was closer to 85%.

Lifecycle Stage Analysis (2 pages)

Map the failure to the seven-stage lifecycle:

  1. Discovery: What was the original business problem? Was it well-defined?
  2. Data Engineering: What data was used? Were there quality issues?
  3. Model Planning: Were appropriate methods chosen?
  4. Model Building: What algorithms were used? How were they validated?
  5. Communication: How were results presented to stakeholders?
  6. Deployment: How was the solution implemented?
  7. Monitoring: Was ongoing performance tracked?

For each stage, identify: - What went wrong (if anything) - Warning signs that were missed - Stakeholders who should have intervened

Root Cause Analysis (1 page)

Apply the “Swiss Cheese Model” to identify failure layers:

  • Individual: Personal decisions, biases, skill gaps
  • Team: Communication failures, groupthink, inadequate review
  • Organizational: Pressure, incentives, resource constraints
  • Regulatory: Oversight gaps, compliance failures

Prevention Framework (1 page)

Design specific interventions for each lifecycle stage:

  • Checkpoints: What questions should be asked at each stage?
  • Guardrails: What controls would prevent similar failures?
  • Stakeholders: Who should be involved in oversight?
  • Metrics: How would you measure project health?

51.7 Project Failure Prevention Toolkit

Objective

Create a practical toolkit for preventing common data science project failures.

You are the first data science hire at a mid-sized company. Based on industry failure rates of 60-85%, you want to create a toolkit that will help your projects succeed where others fail.

The most important reason for failure of data science projects is the “last-mile problem” of data science: the struggle to deploy the result of the data analysis (models) into processes and applications where they are used by the organization and are serving the intended end user.

Project Health Checklist

Create a checklist for each lifecycle stage with specific yes/no questions:

Discovery Phase

  • [Continue with 2-3 more questions per phase…]

Data Engineering Phase

  • [Continue…]

Create similar checklists for all seven phases.

Red Flag Early Warning System

Design an early warning system with specific indicators:

Technical Red Flags

  • Model accuracy that seems too good to be true
  • Data that’s “too clean” without explanation
  • Significant difference between training and validation performance

Process Red Flags

  • Scope creep without timeline adjustments
  • Stakeholder disagreement on success metrics
  • IT department not involved in deployment planning

Organizational Red Flags

  • Lack of domain expert involvement
  • Unrealistic timeline expectations
  • No plan for ongoing model maintenance

Intervention Playbook

For each red flag category, provide specific intervention strategies:

When Technical Issues Arise

  • Stop and investigate rather than pushing forward
  • Bring in additional technical expertise
  • Reassess approach and methodology

When Process Issues Arise

  • [Specific actions…]

When Organizational Issues Arise

  • [Specific actions…]

Communication Templates

Create email templates for common challenging situations:

  1. Project Delay Notification: Template for informing stakeholders about timeline changes
  2. Scope Change Request: Template for requesting approval for scope modifications
  3. Resource Request: Template for requesting additional resources or expertise
  4. Risk Escalation: Template for escalating risks to executive sponsors

51.8 Team Structure Design Challenge

Objective

Design optimal data science team structures for different organizational contexts.

You are a data science consultant hired to design team structures for three different organizations:

  1. StartupTech: 50-person SaaS company, needs first data science hire
  2. RegionalBank: 5,000-person financial institution, wants to expand existing analytics
  3. GlobalRetail: 50,000-person multinational, has scattered data efforts across divisions

The question whether to operate data science team(s) as a centralized team—also known as a data science center of excellence (COE)—or as separate decentralized teams in the units of the organization, is irrelevant for small organizations with a single team or possibly a single data scientist. In larger organizations the question is relevant.

Role Definition (1 page)

For each organization, specify:

  • Required roles: Which positions are essential vs. nice-to-have?
  • Skill priorities: What expertise matters most for their context?
  • Reporting structure: Where should data science roles sit organizationally?

Use the role definitions from Chapter 4:

  • Data Scientist
  • Data Engineer
  • Data Analyst
  • Data Science Architect
  • Data Science Developer/ML Engineer
  • Product Manager
  • Project Manager

Organizational Structure Recommendation (1 page)

For each organization, choose and justify:

  • Centralized: Single data science center of excellence
  • Decentralized: Embedded teams in business units
  • Hybrid: Mixed approach with shared infrastructure

Consider:

  • Company size and maturity
  • Data governance needs
  • Business unit autonomy
  • Available talent pool
  • Budget constraints

Implementation Roadmap (1 page)

Create a hiring and development plan for each organization:

  • Phase 1 (0-6 months): Critical first hires
  • Phase 2 (6-18 months): Team expansion
  • Phase 3 (18+ months): Optimization and scaling

Include:

  • Specific job descriptions for first hires
  • Success metrics for each phase
  • Budget estimates
  • Risk mitigation strategies

51.9 Computational Thinking Case Study

Objective

Apply the five elements of computational thinking to solve a complex data science problem.

You work for a city government that wants to optimize emergency response times. The fire department, police, and paramedics currently operate independently, often resulting in delayed responses and inefficient resource allocation. Design a data-driven solution to optimize emergency response across all three services.

The five elements of computational thinking are Problem Definition, Decomposition (Factoring), Pattern recognition, Generalization (Abstraction), and Algorithm Design. Working on a data science project combines computational thinking and quantitative thinking in the face of uncertainty.

Problem Definition

Apply the first element of computational thinking:

  • Specific Problem Statement: What exactly are you trying to optimize?
  • Success Metrics: How will you measure improvement?
  • Constraints: What limitations must your solution work within?
  • Stakeholders: Who are the key players and what do they need?

Decomposition

Break the complex problem into manageable parts:

  • Response Process Steps: Map the current emergency response workflow
  • Data Components: What types of data are involved?
  • System Components: Which departments, technologies, and processes are involved?
  • Sub-problems: What smaller problems can be solved independently?

Pattern Recognition  

Identify patterns and relationships:

  • Historical Patterns: What trends exist in emergency response data?
  • Geographic Patterns: How do location factors affect response times?
  • Temporal Patterns: How do time-of-day, day-of-week, and seasonal factors matter?
  • Resource Patterns: How are current resources allocated and utilized?

Generalization

Abstract the core logic of your solution:

  • General Rules: What universal principles apply to emergency response optimization?
  • Transferable Components: Which parts of your solution could work for other cities?
  • Core Algorithm Logic: What is the essential logic of your optimization approach?
  • Scalability Considerations: How would your solution adapt to different city sizes?

Algorithm Design

Design the step-by-step solution:

  • Data Collection Algorithm: How will you gather and process emergency data?
  • Prediction Algorithm: How will you forecast response needs?
  • Optimization Algorithm: How will you allocate resources optimally?
  • Feedback Algorithm: How will the system learn and improve over time?

Include specific steps, decision points, and implementation details.

51.10 Quantitative Thinking & Data Intuition

Objective

Develop data intuition skills and practice quantitative thinking through analysis of tricky datasets.

Developing intuition for data is one of the most valuable skills a data scientist can acquire. It will help to flag things that are surprising, curious, worth another look, or that do not pass the smell test.

Data Intuition Scenarios

For each scenario below, identify what might be wrong and propose explanations:

E-commerce Conversion Rates

  • Desktop: 3.2% conversion rate
  • Mobile: 1.8% conversion rate
  • Tablet: 0.1% conversion rate

What’s suspicious about the tablet rate? What could explain this?

Survey Response Analysis

A satisfaction survey shows 95% of customers rate the service as “excellent” or “very good”, but customer retention is declining.

What data intuition flags should this raise?

A/B Test Results

Version A: 10,000 users, 8.5% conversion Version B: 100 users, 15.2% conversion

Should you implement Version B? What’s your data intuition?

Regional Sales Performance

All sales regions show 15-20% growth except one region showing 45% growth.

What questions would your data intuition prompt you to ask?

Quantification Challenge

Practice quantitative thinking by estimating and justifying:

Net Promoter Score Analysis

A SaaS company’s NPS improved from 30 to 40. Explain three different ways this could have happened (specific scenarios with numbers).

Happiness Index Design

Design a “Student Happiness Index” for your university. What measurable components would you include? How would you weight them? What are the assumptions in your quantification approach?

Business Metric Creation

Create a single metric to measure “data science team effectiveness” in an organization. Justify your choice and explain potential limitations.

Pattern Recognition Puzzles

Solve these data science-related pattern recognition challenges:

Time Series Pattern

Monthly revenue: Jan $100K, Feb $150K, Mar $125K, Apr $175K, May $150K, Jun $200K

What pattern do you see? Predict July and August.

Classification Pattern

Feature A Feature B Category
2 8 A
4 6 A
6 4 B
8 2 B
3 ? A

What should Feature B be for the last row?

Correlation Insight

You find a strong positive correlation (r=0.85) between ice cream sales and drowning deaths by month. A marketing executive wants to ban ice cream sales to prevent drownings.

Use your data intuition to explain what’s really happening.

51.11 Assessment Criteria

  • Solutions are implementable in real-world contexts
  • Shows understanding of business constraints and requirements
  • Demonstrates awareness of organizational dynamics
  • Applies course frameworks correctly
  • Shows systematic problem-solving approach
  • Identifies key issues and relationships
  • Ideas are expressed clearly and concisely
  • Uses appropriate business language
  • Provides actionable recommendations
  • Connects concepts from different course modules
  • Shows understanding of data science project complexities
  • Demonstrates growth in thinking skills