7 Assignments & Exercises

7.1 Two Cultures — Reading Assignment

Objective

Read Breiman’s “Two Cultures” article and the response by Brad Efron. Answer questions about the points raised in the original article and the response.

Read Breiman (2001) and the response by Brad Efron, professor of statistics at Stanford, former president of the American Statistical Association and the Institute of Mathematical Statistics, and inventor of the bootstrap technique.

Also, read Breiman’s rejoinder to Efron’s comment.

Which theory, in Efron’s view, has made statistics the “dominant interpretational methodology in dozens of fields”?
What is required to sustain that dominance and what trends are signaling a change?
One of the issues raised by Efron is that new algorithms have many “knobs to twiddle”. Breiman rejoins that random forests have only one parameter and support vector machines have only two. Check out the RandomForestClassifier and SVC functions in scikit-learn. How many of the function arguments are actual “knobs to twiddle” that we would classify as hyperparameters that affect the model training process?
How does Breiman respond to Efron’s rules:
Rule 1. New methods always look better than old ones.
Rule 2. Complicated methods are harder to criticize than simple ones.

7.2 Two Cultures Analysis — Data vs. Algorithmic Modeling

Objective

Apply Leo Breiman’s “Two Cultures” framework to evaluate modeling approaches for a specific business problem

You are consulting for a subscription streaming service that wants to predict customer churn. The executive team is divided: the statisticians want to build an interpretable logistic regression model to understand why customers leave, while the ML engineers want to deploy a neural network to achieve the highest possible prediction accuracy.

Breiman (2001) contrasted statistical (data) modeling and algorithmic modeling. The former assumes that the data are generated by a stochastic data model. If we accept the data model as an abstraction of nature’s mechanism, then conclusions about nature will be based on the model. The data model becomes the lens through which we interpret the world.

Model Comparison Framework (1 page)

Data Modeling Approach

Design/Describe a statistical model for churn prediction

Specify your assumptions about the data-generating process
List the variables you would include and why
Explain what business insights this approach would provide

Algorithmic Modeling Approach

Design/Describe an algorithmic solution

Choose your ML algorithm and justify the selection
Describe how you would find or engineer features (model inputs)
Explain how you would optimize for prediction accuracy

Trade-off Analysis (1 page)

Create a decision matrix comparing the approaches across:

Interpretability: Can business users understand the model?
Accuracy: Prediction performance on holdout data
Generalizability: How well will it work on future data?
Implementation: Ease of deployment and maintenance
Business Value: Which approach better serves organizational needs?

Recommendation & Justification (1 page)

Choose your recommended approach and explain why
Address the concerns of the opposing camp
Design a hybrid solution that combines strengths of both approaches

7.3 Evolution of Data Science Roles

Objective

Understand the evolution of data science roles.

Investigate how data science roles have evolved in the past decade. Interview 2-3 data professionals (statistician, data scientist, data engineer, data analyst, business analyst, etc.) and compare their daily responsibilities.

Write a paper with your findings (about 1,000 words).

7.4 Self-Assessment

Objective

Assess your personal readiness with respect to technical foundation, business foundation, and human skills.

Evaluate your current competencies across:

Technical Foundation (math, stats, programming)
Business Foundation (project management, business understanding, domain knowledge)
Human Skills (communication, teamwork, problem-solving)

Rate your competence on a five-point Likert scale with categories Emerging, Developing, Competent, Proficient, and Mastery. Identify three areas for improvement.

7.5 Career Path Analysis

Objective

Understand current careers in data science.

Research five job postings for data science roles in your target industry. Analyze the required versus the preferred qualifications.

Are the postings consistent with respect to job titles and qualifications?
Which non-technical skills are emphasized?
Which technical skills are emphasized?

7.6 Technology Adaptation & Learning

Objective

Demonstrate ability to quickly learn and evaluate new data science technologies.

Technology Trend Analysis

Identify an emerging technology in data science that was not mainstream 5 years ago (e.g., LLMs, MLOps tools, new visualization frameworks). Research and create a briefing document covering:

Technology overview and capabilities
Potential applications in business contexts
Advantages and limitations
Implementation considerations
Future outlook

Hands-on Exploration

Spend a day learning and experimenting with this technology. Document your learning process:

Resources used
Challenges encountered
Key insights gained
Practical applications identified

Recommendation Report

Write a 2-page executive summary recommending whether your organization should invest in this technology, including:

Strategic fit assessment
Cost-benefit analysis
Implementation roadmap
Risk mitigation strategies

7.7 Project Failure Analysis

Objective

Analyze a data science project failure using the seven-stage lifecycle framework and design prevention strategies.

Choose one of these real project failures to analyze:

Theranos Blood Testing: Elizabeth Holmes’ fraudulent blood testing technology
Key Resources:
- DOJ Sentencing Report (Official government case)
- Harvard Business School Case Study (Academic analysis)
- Pan Macmillan Timeline (Comprehensive overview)
IBM Watson for Oncology: Cancer treatment recommendations that were unsafe and incorrect
Key Resources:
- STAT Investigation (Original reporting)
- Healthcare Dive Analysis (Industry perspective)
- AI Incident Database (Systematic documentation)
Google Flu Trends: Search-based flu prediction that dramatically overestimated outbreaks
Key Resources:
- Wikipedia Overview (Comprehensive timeline)
- TIME Magazine Analysis (Big data critique)
- PLOS Research Paper (Academic study)
Microsoft Tay Chatbot: AI that became racist within 24 hours
Key Resources:
- Wikipedia Overview
Cambridge Analytica: Facebook data used for political manipulation
Key Resources:
- Wikipedia Timeline (Comprehensive overview)
- Bipartisan Policy Center Analysis (Policy perspective)
- CNBC Timeline (Business timeline)

Data science projects involve many steps, diverse teams, and projects can fail at any step. The model building phase is not where most unsuccessful projects go off the rails. Industry analyst firm Gartner estimated in 2016 that about 60% of data science projects are failing and admitted two years later that was an underestimate; the real failure rate was closer to 85%.

Lifecycle Stage Analysis (2 pages)

Map the failure to the seven-stage lifecycle:

Discovery: What was the original business problem? Was it well-defined?
Data Engineering: What data was used? Were there quality issues?
Model Planning: Were appropriate methods chosen?
Model Building: What algorithms were used? How were they validated?
Communication: How were results presented to stakeholders?
Deployment: How was the solution implemented?
Monitoring: Was ongoing performance tracked?

For each stage, identify: - What went wrong (if anything) - Warning signs that were missed - Stakeholders who should have intervened

Root Cause Analysis (1 page)

Apply the “Swiss Cheese Model” to identify failure layers:

Individual: Personal decisions, biases, skill gaps
Team: Communication failures, groupthink, inadequate review
Organizational: Pressure, incentives, resource constraints
Regulatory: Oversight gaps, compliance failures

In the Swiss Cheese Model, the defenses against failures are modeled as a series of slices of cheese, each having holes (eyes) at seemingly random locations. The holes represent weaknesses that can pass through the layers. The defense against failures works when the holes do not align so that a weakness passing through one layer is caught by a subsequent layer. The system fails when holes align and weaknesses pass through without getting caught.

Prevention Framework (1 page)

Design specific interventions for each lifecycle stage:

Checkpoints: What questions should be asked at each stage?
Guardrails: What controls would prevent similar failures?
Stakeholders: Who should be involved in oversight?
Metrics: How would you measure project health?

7.8 Project Failure Prevention Toolkit

Objective

Create a practical toolkit for preventing common data science project failures.

You are the first data science hire at a mid-sized company. Based on industry failure rates of 60-85%, you want to create a toolkit that will help your projects succeed where others fail.

The most important reason for failure of data science projects is the “last-mile problem” of data science: the struggle to deploy the result of the data analysis (models) into processes and applications where they are used by the organization and are serving the intended end user.

Project Health Checklist

Create a checklist for each lifecycle stage with specific yes/no questions:

Discovery Phase

Is the business problem specific enough to measure success?
Have we identified all key stakeholders?
Do we have executive sponsorship?
[Continue with 2-3 more questions per phase…]

Data Engineering Phase

Have we assessed data quality and availability?
Are data sources reliable and accessible?
[Continue…]

Create similar checklists for all seven phases.

Red Flag Early Warning System

Design an early warning system with specific indicators:

Technical Red Flags

Model accuracy that seems too good to be true
Data that’s “too clean” without explanation
Significant difference between training and validation performance

Process Red Flags

Scope creep without timeline adjustments
Stakeholder disagreement on success metrics
IT department not involved in deployment planning

Organizational Red Flags

Lack of domain expert involvement
Unrealistic timeline expectations
No plan for ongoing model maintenance

Intervention Playbook

For each red flag category, provide specific intervention strategies:

When Technical Issues Arise

Stop and investigate rather than pushing forward
Bring in additional technical expertise
Reassess approach and methodology

When Process Issues Arise

[Specific actions…]

When Organizational Issues Arise

[Specific actions…]

Communication Templates

Create email templates for common challenging situations:

Project Delay Notification: Template for informing stakeholders about timeline changes
Scope Change Request: Template for requesting approval for scope modifications
Resource Request: Template for requesting additional resources or expertise
Risk Escalation: Template for escalating risks to executive sponsors

7.9 Team Structure Design Challenge

Objective

Design optimal data science team structures for different organizational contexts.

You are a data science consultant charged with designing team structures for three different organizations:

StartupTech: 50-person SaaS company, needs first data science hire
RegionalBank: 5,000-person financial institution, wants to expand existing analytics
GlobalRetail: 50,000-person multinational, has scattered data efforts across divisions

Remember that the question whether to operate data science team(s) as a centralized team—also known as a data science center of excellence (COE)—or as separate decentralized teams in the units of the organization, is relevant in larger organizations.

Role Definition (1 page)

For each organization, specify:

Required roles: Which positions are essential vs. nice-to-have?
Skill priorities: What expertise matters most for their context?
Reporting structure: Where should data science roles sit organizationally?

Use the role definitions from Chapter 4:

Data Scientist
Data Engineer
Data Analyst
Data Science Architect
Data Science Developer/ML Engineer
Product Manager
Project Manager

Organizational Structure Recommendation (1 page)

For each organization, choose and justify:

Centralized: Single data science center of excellence
Decentralized: Embedded teams in business units
Hybrid: Mixed approach with shared infrastructure

Consider:

Company size and maturity
Data governance needs
Business unit autonomy
Available talent pool
Budget constraints

Implementation Roadmap (1 page)

Create a hiring and development plan for each organization:

Phase 1 (0-6 months): Critical first hires
Phase 2 (6-18 months): Team expansion
Phase 3 (18+ months): Optimization and scaling

Include:

Specific job descriptions for first hires
Success metrics for each phase
Budget estimates
Risk mitigation strategies

7.10 Computational Thinking Case Study

Objective

Apply the five elements of computational thinking to solve a complex data science problem.

You work for a city government that wants to optimize emergency response times. The fire department, police, and paramedics currently operate independently, often resulting in delayed responses and inefficient resource allocation. Design a data-driven solution to optimize emergency response across all three services.

Recall that the five elements of computational thinking are Problem Definition, Decomposition (Factoring), Pattern recognition, Generalization (Abstraction), and Algorithm Design.

Problem Definition

Apply the first element of computational thinking:

Specific Problem Statement: What exactly are you trying to optimize?
Success Metrics: How will you measure improvement?
Constraints: What limitations must your solution work within?
Stakeholders: Who are the key players and what do they need?

Decomposition

Break the complex problem into manageable parts:

Response Process Steps: Map the current emergency response workflow
Data Components: What types of data are involved?
System Components: Which departments, technologies, and processes are involved?
Sub-problems: What smaller problems can be solved independently?

Pattern Recognition

Identify patterns and relationships:

Historical Patterns: What trends exist in emergency response data?
Geographic Patterns: How do location factors affect response times?
Temporal Patterns: How do time-of-day, day-of-week, and seasonal factors matter?
Resource Patterns: How are current resources allocated and utilized?

Generalization

Abstract the core logic of your solution:

General Rules: What universal principles apply to emergency response optimization?
Transferable Components: Which parts of your solution could work for other cities?
Core Algorithm Logic: What is the essential logic of your optimization approach?
Scalability Considerations: How would your solution adapt to different city sizes?

Algorithm Design

Design the step-by-step solution:

Data Collection Algorithm: How will you gather and process emergency data?
Prediction Algorithm: How will you forecast response needs?
Optimization Algorithm: How will you allocate resources optimally?
Feedback Algorithm: How will the system learn and improve over time?

Include specific steps, decision points, and implementation details.

7.11 Net Promoter Score

Refer to the discussion of the Net Promoter Score in Chapter 5.

What are the assumptions in the NPS calculation?
Suppose that company Foo improved its NPS from 30 to 40 over the last year. Explain how that can happen?
What does NPS tell you about a company that has many products and/or services?
What impact could cultural differences and societal norms and traditions have on NPS values around the world?
What do you think is a great net promoter score? Does it depend on the industry?
Companies are applying NPS in other contexts, not just to measure customer satisfaction. For example, the employee NPS (eNPS) uses NPS methodology and the question “How likely are you to recommend company X as a place of work?” What do you think about that?
If you plot NPS by age, what would that look like? In other words, do you expect younger or older consumers to have higher/lower NPS?
List reasons why NPS is (might be) a troubling indicator.

7.12 World Happiness Report

Read the section Measuring and Explaining National Differences in Life Evaluations in the 2024 World Happiness Report

Which variables is the ranking of happiness based on?
How many citizens of each country participate in the survey?
Does WHR collect its own data or does it rely on someone else’s survey?
The data includes three indicators for well-being. Are they all used in determining the happiness rankings?
In the discussion of the methods, can you determine whether the happiness rankings involve some form of modeling, where survey responses are tied to other variables? If so, what are the variables?

7.13 Indicators: Quantifying the Economy

An indicator is a quantitative or qualitative factor or variable that offers a direct, simple, unique and reliable signal or means to measure achievements. Find at least two indicators in each of the following aspects of the economy:

International trade
Housing and construction
Consumer spending
Manufacturing
Climate
Labor markets

What are the indicators used for–that is, what do they indicate?

7.14 Data Intuition

Objective

Develop data intuition skills and practice quantitative thinking through analysis of tricky datasets.

Developing intuition for data is one of the most valuable skills a data scientist can acquire. It will help to flag things that are surprising, curious, worth another look, or that do not pass the smell test.

For each scenario below, identify what might be wrong and propose explanations:

E-commerce Conversion Rates

Desktop: 3.2% conversion rate
Mobile: 1.8% conversion rate
Tablet: 0.1% conversion rate

What is suspicious about the tablet rate? What could explain this?

Survey Response Analysis
A satisfaction survey shows 95% of customers rate the service as “excellent” or “very good”, but customer retention is declining.

What flags should this raise?

A/B Test Results

Version A: 10,000 users, 8.5% conversion
Version B: 100 users, 12% conversion

Should you implement Version B? What is your intuition?

Regional Sales Performance
All sales regions show 15-20% growth except one region showing 45% growth.

What questions does your data intuition prompt you to ask?

7.15 Quantification Challenge

Practice quantitative thinking by estimating and justifying:

Net Promoter Score Analysis
A SaaS company’s NPS improved from 30 to 40. Explain three different ways this could have happened (give specific scenarios; with numbers if possible).

Happiness Index Design
Design a “Student Happiness Index” for your university. What measurable components would you include? How would you weigh them? What are the assumptions in your quantification approach?

Business Metric Creation
Create a single metric to measure “data science team effectiveness” in an organization. Justify your choice and explain potential limitations.

Correlation Insight
You find a strong positive correlation (r=0.85) between ice cream sales and drowning deaths by month. A marketing executive wants to ban ice cream sales to prevent drownings. Explain what is (might be) really happening.

7.16 Probability Puzzles

The Loaded Coin

You have five coins. Four of them are fair coins with heads and tails are equally likely. The fifth coin has been altered so that it always lands on heads. You draw one of the coins at random.

If you flip the randomly chosen coin twice, what is the probability to observe two heads?
What is the probability to draw the rigged coin?
What is the probability to have drawn the rigged coin if you flip the coin twice and both tosses land on head?

Where Was I?

A hiker climbs up a mountain path and camps for the night when he arrives at the mountain top. The next day he begins the descent down the same trail to the bottom of the mountain when suddenly he looks at his watch and exclaims,

*That is amazing! I was at this very same spot at exactly the same time of day yesterday on my way up.

What is the probability that a hiker will be at exactly the same spot on the mountain at the same time of day on his return trip, as he was on the previous day’s hike up the mountain?

Drawing Socks

Mismatched Joe is in a pitch dark room selecting socks from his drawer. He has only six socks in his drawer, and he owns only black and white socks. If he chooses two socks at random, the chances that he draws out a white pair is 2/3. What are the chances that he gets a black pair of socks?

7.17 Logic Reasoning

Wason Selection Task

This test was invented by psychologist Peter C. Wason. You are presented with four cards, a card has a letter on one side and a number on the opposite side (Figure 7.1).

Which card or cards must you flip over to verify if the following statement is true

If a card has a vowel on one side, then it has an even number on the other side.

Figure 7.1: Four cards for the Wason Selection Task.

The point of the selection task is to flip the smallest number of cards necessary to verify the validity of the statement.

Birth Months

Four sisters, Sara, Ophelia, Nora, and Dawn, were each born in a different one of the months September, October, November, and December.

Ophelia said one day

This is terrible. None of us have an initial that matches the initial of her birth month.

The girl who was born in September replied

I don’t mind at all.

That triggered a response from Nora:

That’s easy for you to say. It would at least be cool if the initial of my birth month was a vowel, but no.

In which month was each girl born?

Five Names

A family has five children. The first was born in April, her name is May. The second was born in May, her name is June. Two sons born in June and July are named Julian and August. What is the name of the last child, born in August.

7.18 Pattern Recognition Puzzles

Time Series

A company posts the following monthly revenue numbers:

Jan $100K
Feb $150K
Mar $125K
Apr $175K
May $150K
Jun $200K

What is your prediction for revenue in July and August?

Classification

The following table shows data on a two-category variable and features $X_1$ and $X_2$. The value for $X_2$ in the last row is missing.

Feature $X_1$	Feature $X_2$	Category
2	8	A
4	6	A
6	4	B
8	2	B
3	?	A

What value do you choose to impute the missing data point?