51 Understanding Data Science
51.1 Two Cultures Analysis — Data vs. Algorithmic Modeling
Apply Leo Breiman’s “Two Cultures” framework to evaluate modeling approaches for a specific business problem
You are consulting for a subscription streaming service that wants to predict customer churn. The executive team is divided: the statisticians want to build an interpretable logistic regression model to understand why customers leave, while the ML engineers want to deploy a neural network to achieve the highest possible prediction accuracy.
Breiman (2001) contrasted statistical (data) modeling and algorithmic modeling. The former assumes that the data are generated by a stochastic data model. If we accept the data model as an abstraction of nature’s mechanism, then conclusions about nature will be based on the model. The data model becomes the lens through which we interpret the world.
Model Comparison Framework (1 page)
Data Modeling Approach
Design/Describe a statistical model for churn prediction
- Specify your assumptions about the data-generating process
- List the variables you would include and why
- Explain what business insights this approach would provide
Algorithmic Modeling Approach
Design/Describe an algorithmic solution
- Choose your ML algorithm and justify the selection
- Describe how you would find or engineer features (model inputs)
- Explain how you would optimize for prediction accuracy
Trade-off Analysis (1 page)
Create a decision matrix comparing the approaches across:
- Interpretability: Can business users understand the model?
- Accuracy: Prediction performance on holdout data
- Generalizability: How well will it work on future data?
- Implementation: Ease of deployment and maintenance
- Business Value: Which approach better serves organizational needs?
Recommendation & Justification (1 page)
- Choose your recommended approach and explain why
- Address the concerns of the opposing camp
- Design a hybrid solution that combines strengths of both approaches
51.2 Evolution of Data Science Roles
Understand the evolution of data science roles.
Investigate how data science roles have evolved in the past decade. Interview 2-3 data professionals (statistician, data scientist, data engineer, data analyst, business analyst, etc.) and compare their daily responsibilities.
Write a paper with your findings (about 1,000 words).
51.3 Self-Assessment
Assess your personal readiness with respect to technical foundation, business foundation, and human skills.
Evaluate your current competencies across:
- Technical Foundation (math, stats, programming)
- Business Foundation (project management, business understanding, domain knowledge)
- Human Skills (communication, teamwork, problem-solving)
Rate your competence on a five-point Likert scale with categories Emerging, Developing, Competent, Proficient, and Mastery. Identify three areas for improvement.
51.4 Career Path Analysis
Understand current careers in data science.
Research five job postings for data science roles in your target industry. Analyze the required versus the preferred qualifications.
- Are the postings consistent with respect to job titles and qualifications?
- Which non-technical skills are emphasized?
- Which technical skills are emphasized?
51.5 Technology Adaptation & Learning
Demonstrate ability to quickly learn and evaluate new data science technologies.
Technology Trend Analysis
Identify an emerging technology in data science that was not mainstream 5 years ago (e.g., LLMs, MLOps tools, new visualization frameworks). Research and create a briefing document covering:
- Technology overview and capabilities
- Potential applications in business contexts
- Advantages and limitations
- Implementation considerations
- Future outlook
Hands-on Exploration
Spend a day learning and experimenting with this technology. Document your learning process:
- Resources used
- Challenges encountered
- Key insights gained
- Practical applications identified
Recommendation Report
Write a 2-page executive summary recommending whether your organization should invest in this technology, including:
- Strategic fit assessment
- Cost-benefit analysis
- Implementation roadmap
- Risk mitigation strategies
51.6 Project Failure Analysis
Analyze a data science project failure using the seven-stage lifecycle framework and design prevention strategies.
Choose one of these real project failures to analyze:
- Theranos Blood Testing: Elizabeth Holmes’ fraudulent blood testing technology
Key Resources:- DOJ Sentencing Report (Official government case)
- Harvard Business School Case Study (Academic analysis)
- Pan Macmillan Timeline (Comprehensive overview)
- IBM Watson for Oncology: Cancer treatment recommendations that were unsafe and incorrect
Key Resources:- STAT Investigation (Original reporting)
- Healthcare Dive Analysis (Industry perspective)
- AI Incident Database (Systematic documentation)
- Google Flu Trends: Search-based flu prediction that dramatically overestimated outbreaks
Key Resources:- Wikipedia Overview (Comprehensive timeline)
- TIME Magazine Analysis (Big data critique)
- PLOS Research Paper (Academic study)
- Microsoft Tay Chatbot: AI that became racist within 24 hours
Key Resources: - Cambridge Analytica: Facebook data used for political manipulation
Key Resources:- Wikipedia Timeline (Comprehensive overview)
- Bipartisan Policy Center Analysis (Policy perspective)
- CNBC Timeline (Business timeline)
Data science projects involve many steps, diverse teams, and projects can fail at any step. The model building phase is not where most unsuccessful projects go off the rails. Industry analyst firm Gartner estimated in 2016 that about 60% of data science projects are failing and admitted two years later that was an underestimate; the real failure rate was closer to 85%.
Lifecycle Stage Analysis (2 pages)
Map the failure to the seven-stage lifecycle:
- Discovery: What was the original business problem? Was it well-defined?
- Data Engineering: What data was used? Were there quality issues?
- Model Planning: Were appropriate methods chosen?
- Model Building: What algorithms were used? How were they validated?
- Communication: How were results presented to stakeholders?
- Deployment: How was the solution implemented?
- Monitoring: Was ongoing performance tracked?
For each stage, identify: - What went wrong (if anything) - Warning signs that were missed - Stakeholders who should have intervened
Root Cause Analysis (1 page)
Apply the “Swiss Cheese Model” to identify failure layers:
- Individual: Personal decisions, biases, skill gaps
- Team: Communication failures, groupthink, inadequate review
- Organizational: Pressure, incentives, resource constraints
- Regulatory: Oversight gaps, compliance failures
Prevention Framework (1 page)
Design specific interventions for each lifecycle stage:
- Checkpoints: What questions should be asked at each stage?
- Guardrails: What controls would prevent similar failures?
- Stakeholders: Who should be involved in oversight?
- Metrics: How would you measure project health?
51.7 Project Failure Prevention Toolkit
Create a practical toolkit for preventing common data science project failures.
You are the first data science hire at a mid-sized company. Based on industry failure rates of 60-85%, you want to create a toolkit that will help your projects succeed where others fail.
The most important reason for failure of data science projects is the “last-mile problem” of data science: the struggle to deploy the result of the data analysis (models) into processes and applications where they are used by the organization and are serving the intended end user.
Project Health Checklist
Create a checklist for each lifecycle stage with specific yes/no questions:
Discovery Phase
- [Continue with 2-3 more questions per phase…]
Data Engineering Phase
- [Continue…]
Create similar checklists for all seven phases.
Red Flag Early Warning System
Design an early warning system with specific indicators:
Technical Red Flags
- Model accuracy that seems too good to be true
- Data that’s “too clean” without explanation
- Significant difference between training and validation performance
Process Red Flags
- Scope creep without timeline adjustments
- Stakeholder disagreement on success metrics
- IT department not involved in deployment planning
Organizational Red Flags
- Lack of domain expert involvement
- Unrealistic timeline expectations
- No plan for ongoing model maintenance
Intervention Playbook
For each red flag category, provide specific intervention strategies:
When Technical Issues Arise
- Stop and investigate rather than pushing forward
- Bring in additional technical expertise
- Reassess approach and methodology
When Process Issues Arise
- [Specific actions…]
When Organizational Issues Arise
- [Specific actions…]
Communication Templates
Create email templates for common challenging situations:
- Project Delay Notification: Template for informing stakeholders about timeline changes
- Scope Change Request: Template for requesting approval for scope modifications
- Resource Request: Template for requesting additional resources or expertise
- Risk Escalation: Template for escalating risks to executive sponsors
51.8 Team Structure Design Challenge
Design optimal data science team structures for different organizational contexts.
You are a data science consultant hired to design team structures for three different organizations:
- StartupTech: 50-person SaaS company, needs first data science hire
- RegionalBank: 5,000-person financial institution, wants to expand existing analytics
- GlobalRetail: 50,000-person multinational, has scattered data efforts across divisions
The question whether to operate data science team(s) as a centralized team—also known as a data science center of excellence (COE)—or as separate decentralized teams in the units of the organization, is irrelevant for small organizations with a single team or possibly a single data scientist. In larger organizations the question is relevant.
Role Definition (1 page)
For each organization, specify:
- Required roles: Which positions are essential vs. nice-to-have?
- Skill priorities: What expertise matters most for their context?
- Reporting structure: Where should data science roles sit organizationally?
Use the role definitions from Chapter 4:
- Data Scientist
- Data Engineer
- Data Analyst
- Data Science Architect
- Data Science Developer/ML Engineer
- Product Manager
- Project Manager
Organizational Structure Recommendation (1 page)
For each organization, choose and justify:
- Centralized: Single data science center of excellence
- Decentralized: Embedded teams in business units
- Hybrid: Mixed approach with shared infrastructure
Consider:
- Company size and maturity
- Data governance needs
- Business unit autonomy
- Available talent pool
- Budget constraints
Implementation Roadmap (1 page)
Create a hiring and development plan for each organization:
- Phase 1 (0-6 months): Critical first hires
- Phase 2 (6-18 months): Team expansion
- Phase 3 (18+ months): Optimization and scaling
Include:
- Specific job descriptions for first hires
- Success metrics for each phase
- Budget estimates
- Risk mitigation strategies
51.9 Computational Thinking Case Study
Apply the five elements of computational thinking to solve a complex data science problem.
You work for a city government that wants to optimize emergency response times. The fire department, police, and paramedics currently operate independently, often resulting in delayed responses and inefficient resource allocation. Design a data-driven solution to optimize emergency response across all three services.
The five elements of computational thinking are Problem Definition, Decomposition (Factoring), Pattern recognition, Generalization (Abstraction), and Algorithm Design. Working on a data science project combines computational thinking and quantitative thinking in the face of uncertainty.
Problem Definition
Apply the first element of computational thinking:
- Specific Problem Statement: What exactly are you trying to optimize?
- Success Metrics: How will you measure improvement?
- Constraints: What limitations must your solution work within?
- Stakeholders: Who are the key players and what do they need?
Decomposition
Break the complex problem into manageable parts:
- Response Process Steps: Map the current emergency response workflow
- Data Components: What types of data are involved?
- System Components: Which departments, technologies, and processes are involved?
- Sub-problems: What smaller problems can be solved independently?
Pattern Recognition
Identify patterns and relationships:
- Historical Patterns: What trends exist in emergency response data?
- Geographic Patterns: How do location factors affect response times?
- Temporal Patterns: How do time-of-day, day-of-week, and seasonal factors matter?
- Resource Patterns: How are current resources allocated and utilized?
Generalization
Abstract the core logic of your solution:
- General Rules: What universal principles apply to emergency response optimization?
- Transferable Components: Which parts of your solution could work for other cities?
- Core Algorithm Logic: What is the essential logic of your optimization approach?
- Scalability Considerations: How would your solution adapt to different city sizes?
Algorithm Design
Design the step-by-step solution:
- Data Collection Algorithm: How will you gather and process emergency data?
- Prediction Algorithm: How will you forecast response needs?
- Optimization Algorithm: How will you allocate resources optimally?
- Feedback Algorithm: How will the system learn and improve over time?
Include specific steps, decision points, and implementation details.
51.10 Quantitative Thinking & Data Intuition
Develop data intuition skills and practice quantitative thinking through analysis of tricky datasets.
Developing intuition for data is one of the most valuable skills a data scientist can acquire. It will help to flag things that are surprising, curious, worth another look, or that do not pass the smell test.
Data Intuition Scenarios
For each scenario below, identify what might be wrong and propose explanations:
E-commerce Conversion Rates
- Desktop: 3.2% conversion rate
- Mobile: 1.8% conversion rate
- Tablet: 0.1% conversion rate
What’s suspicious about the tablet rate? What could explain this?
Survey Response Analysis
A satisfaction survey shows 95% of customers rate the service as “excellent” or “very good”, but customer retention is declining.
What data intuition flags should this raise?
A/B Test Results
Version A: 10,000 users, 8.5% conversion Version B: 100 users, 15.2% conversion
Should you implement Version B? What’s your data intuition?
Regional Sales Performance
All sales regions show 15-20% growth except one region showing 45% growth.
What questions would your data intuition prompt you to ask?
Quantification Challenge
Practice quantitative thinking by estimating and justifying:
Net Promoter Score Analysis
A SaaS company’s NPS improved from 30 to 40. Explain three different ways this could have happened (specific scenarios with numbers).
Happiness Index Design
Design a “Student Happiness Index” for your university. What measurable components would you include? How would you weight them? What are the assumptions in your quantification approach?
Business Metric Creation
Create a single metric to measure “data science team effectiveness” in an organization. Justify your choice and explain potential limitations.
Pattern Recognition Puzzles
Solve these data science-related pattern recognition challenges:
Time Series Pattern
Monthly revenue: Jan $100K, Feb $150K, Mar $125K, Apr $175K, May $150K, Jun $200K
What pattern do you see? Predict July and August.
Classification Pattern
Feature A | Feature B | Category |
---|---|---|
2 | 8 | A |
4 | 6 | A |
6 | 4 | B |
8 | 2 | B |
3 | ? | A |
What should Feature B be for the last row?
Correlation Insight
You find a strong positive correlation (r=0.85) between ice cream sales and drowning deaths by month. A marketing executive wants to ban ice cream sales to prevent drownings.
Use your data intuition to explain what’s really happening.
51.11 Assessment Criteria
- Solutions are implementable in real-world contexts
- Shows understanding of business constraints and requirements
- Demonstrates awareness of organizational dynamics
- Applies course frameworks correctly
- Shows systematic problem-solving approach
- Identifies key issues and relationships
- Ideas are expressed clearly and concisely
- Uses appropriate business language
- Provides actionable recommendations
- Connects concepts from different course modules
- Shows understanding of data science project complexities
- Demonstrates growth in thinking skills