1 Introduction

1.1 How to Use This Material

This material is intended for a regular 3-credit hour course at the graduate level. At Virginia Tech, it is a first-semester core course in the Master of Science in Data Science program¹

If you are self-studying this material, concentrate on the areas where you have gaps. The chapters do not have to be read sequentially. If the terms in the opening chapters are unfamiliar to you, maybe you need to work through Module VIII. Review Topics. If your background is in data engineering, Module III. Data Engineering will be very familiar, spend your time elsewhere.

The material is organized in modules. Module I discusses the history of data science as a cross-disciplinary activity that draws on mathematics, statistics, and computer science and applies the “mixture” to solve real-world problem in a specific domain.

This problem solution is viewed as an end-to-end process that starts with a business, research, or policy question and ends with implementation of a solution that uses data. The data science project life cycle, as we call it, describes the iterative process of working through these projects as a team sport.

The first module also introduces you to some case studies in data science, to data science teams, and how to think like a data scientist. You might enjoy some of the quantitative puzzles at the end of the module.

The modules that follow cover specific steps in the data science project lifecycle more deeply.

Module II introduces you to the discovery phase of the cycle where you develop business understanding and translate it into data science activities.

Module III introduces working with data. We discuss various data sources and formats and how to access them. Not everything comes as a CSV file. Then we go on a “first date” with the data using profiling tools that can highlight problems with data quality. Summarizing and visualizing data and analysis results are major activities that follow once the data is properly pre-processed. We would be remiss not covering the integration of data sources and SQL, the structured querying language that is the backbone of many database management systems. Paraphrasing an employer:

there is much we do not agree on in this field, but we agree that at some point you need to access data in a database and you need to use version control: learn SQL and learn Git.

Module IV. Modeling Data takes up much space since this is a key activity of data scientists. After discussing some general concepts such as types of learning, types of models, bias-variance tradeoff, correlation and causation, and others you get an overview of the various models you build and deploy in data science projects. Our model classification is not just based on the typical organization in supervised and unsupervised methods of learning. It is based on primary questions you might encounter, such as “What natural groupings exist in this data?” or “What values should we expect?”

The goal is to introduce these topics and to develop the vernacular of data science and your thinking as a data scientist. Since the material is foundational we cannot cover models and modeling to the extent necessary for a data science degree. This is left to additional material, for example, the Statistical Learning methods courses.

Module V. Communication presents fundamental concepts and ideas about communicating as a member of a data science project team. You learn aspects of the theory of communication, the importance of storytelling with data, how to present well, and how to explain data science content to a non-technical audience.

In Module VI. Operationalization you learn what it takes to take a data solution from the playpen to production. Many data science projects fail during the last mile, successfully deploying a model within the IT infrastructure of the organization. We discuss offline and online deployments, introduce enterprise machine learning architectures, how to write REST APIs to communicate between services, and the orchestration of the services that make up a complete data science solution.

Module VII. Applied Ethics in Data Science covers an increasingly important topic: working with data in a manner that avoids perpetuating stereotypes, avoids harm, and honors privacy concerns. We cover many case studies and examples where things have gone wrong, due to unintended consequences, focus on the wrong performance metrics, bad data, or for other reasons.

How algorithms can introduce bias and cause harm is the subject of the following chapter. Separate chapters are dedicated to concerns around personal information and data privacy and to ethical considerations in generative artificial intelligence.

The final module of the material, Module VIII. Review Topics contains material that is considered pre-requisite for the Foundation material.

It is followed by two 3-credit courses on statistical learning, a 3-credit course on communications in team-based data science, six credits in computer science, a 3-credit capstone experience, and 12 elective credits from various disciplines. For details on the program, click here.↩︎