Foundations of Data Science

Beyond the Numbers

Author

Oliver Schabenberger

Published

August 6, 2025

Preface

Foundations of Data Science

Since data science appeared on the scene some decades ago it has been subject to constant change. In the beginning many were not quite sure what to make of data science. How was it a scientific discipline and how was it different from statistics or data analytics? Is the scientific method involved? Is data scientist not another name for business analyst?

The dust has settled and we have a much better handle on what data scientists do, the skills they need, and how the role differentiates from other data professionals. At the same time the data profession in general has evolved and keeps expanding. At first, a data scientist was the all-round unicorn that was expert in engineering data, modeling data, writing code, and implementing data solutions. Those individuals were difficult to find and difficult to train.

In smaller organizations the workload of the data scientist continues to be fluid, because the number of folks who can perform data-related tasks is limited. In larger organizations with dedicated data teams you might find data engineers, data analysts, business analysts, machine learning engineers, AI engineers, statistical programmers, and so on. Where does a data scientist fit into this picture and what are the skills a data scientist should have to be competitive in this landscape? How does the rise of AI change all of that?

My personal journey to data science led from forestry to forest biometry to statistics to analytical software development. Like me, many will come to data science from a non-statistical and non-computational path. As educators we need to think about how to ensure the foundations of the discipline are being taught. Assuming that everyone who strives for a career in data science has a solid background in statistics, applied mathematics, and computer science is not helpful. We have to expect varied backgrounds and strengths. Data science, like no other discipline I encountered over three decades working with data, combines technical and non-technical skills, combines hard and “soft” (human) skills, and has a focus on solving real-world problems.

As educators we also need to be aware that statistics + mathematics + computer science is a necessary skill set, it is not a sufficient skill set in data science.

Two Foundations

Being great at mathematics, statistics, and computer science—the foundation disciplines—does not make you a great data scientist. What makes you a great data scientist is to apply that knowledge in the context of a domain to solve real-world problems with data and to add value to your organization.

The second foundation of data science consist of knowing how data projects come into existence, how they are managed, and executed; and being able to operate productively in the different phases of the project. Your role changes depending on whether the project is in the discovery, data engineering, modeling, communication, or implementation phase. The data scientist is a key player throughout this project lifecycle, calling on multiple and different skills along the way.

Acquiring that breadth of knowledge and skills takes time. No one enters a graduate program checking all the boxes. What serves you well is a genuine interest in all aspects that make up data science, the curiosity and drive to fill in gaps, a growth mindset, and a desire for lifelong learning.

If you hate math and are not willing to learn the linear algebra concepts necessary to manipulate data science models—forget it. If you have not programmed before and think that you can pass that off to some software engineer—forget it. If you believe your work does not start until you are presented with a curated, flawless dataset—good luck. If you are afraid of public speaking and are not willing to work on improving communication skills—forget it.

On the other hand …

If you are excited about learning principles of data engineering and statistical learning, about communicating data concepts to colleagues, customers and executives, if you are aching to write better software, and love being part of complex projects characterized by ambiguity and uncertainty, then welcome! Data science has been waiting for you.

The origin for this material lies in recognizing how data science as a discipline has evolved and continues to change. The foundational disciplines remain the same, but the demands on data scientists today are much more varied and complex than some courses and certifications would lead you to believe. The material is not intended to teach the foundation disciplines. There is math and stats and computing throughout, but that is not the focus. For those who need to refresh some pre-requisite knowledge from the foundation disciplines, there is a module with review material.

I have witnessed the transition from creating data/analytic teams in organizations—because that is what everyone was doing—to measuring the success and justification of those teams in terms of value created for the organizations. Many data science projects are faltering, study after study reveals that the majority of data science projects do not succeed. There are many reasons for this sobering reality, how we educate and train for data science careers definitely plays a part.

This course covers the basics of the phases of data science projects, from developing a business understanding to deploying a data science solutions. It exposes you to thinking as a data scientist and data science as a complex team sport.

Differentiation

Almost every type of organization today engages with data and practices some form of data science: an academic department at a university, a consulting company, a non-profit, a local, state, or federal government agency, a university IT department, and so on.

The majority of data science jobs are found in commercial settings, sometimes lumped together as industry. Data science solves real-world problems, the problems companies are interested in solving are those that support the business model. Data for Good, data science in pursuit of making a better world, is a noble goal. It is a discretionary goal for most companies that places behind the primary goal of growing the company value and satisfying share holders. The reality of most jobs is Data for Money.

It is not about slinging code and developing a cool model. As someone said,

a data science model in a Jupyter notebook has $0.0 value.

Understanding and applying classical and modern methods of statistical modeling and machine learning is a given. But your random forest is not going to be any better than your colleague’s random forest. You are probably using the same software and are tuning the same hyperparameters. You can implement LeNet-5 or AlexNet for image classification with a few lines of R or Python code—if you are not sure how, check Stack Exchange or ask an LLM (large language model) for help.

What are the differentiating factors in this competitive field?

Understanding how to take a business problem, research, or policy question and turn it into a data project, executing the data project as a team and translating it back into a business, research, or policy solution.
Being able to communicate with business, information technology, marketing, sales, finance, and legal to define, develop and implement a data-driven solution.
Being at times evangelist, programmer, project manager, instructor, presenter, and implementer.
Understanding the ethical, security, and privacy issues associated with data and data-driven decisions.
Cultivating a desire to learn as new tools and methods affect what can be done with data—GPT anyone?

What about AI?

Since November 2022, when ChatGPT was released on the GPT 3.5 large language model, we have seen a sea change in artificial intelligence capabilities. Other chatbots followed quickly, initially compiling (and hallucinating) responses based on historical training data. That was impressive. But how things have even changed since then!

AI can now write Python or R code and SQL queries. It can clean datasets and produce data quality reports, produce visualizations, generate data, and so on. The AI tools are now actively searching for information that goes beyond the training data and can incorporate large amounts of new information with ease.

What AI cannot do is ask the right business question, operate within a corporate structure, balance efficiency, ethics, and data privacy, communicate trade-offs to non-technical audiences, map statistical metrics to business metrics, and so on.

Will data science be replaced by AI? As with other professions, AI is more likely to replace the execution of tasks than critical thinking and problem solving. Whatever the job will be called in the future, what matters is whether you add value to your organization. If you do, no AI will come for your job. As Analyst Uttam puts it

The world runs on data. Always has, always will.
But the people who succeed in 2025 and beyond?
[…]
You’re not here to write perfect Python.
You’re here to help your company make better decisions.

If in doubt, in a world where AI is omnipresent, check what skills companies are looking for in their position announcements. This is where the value lies. Uttam lists the following:

SQL Fluency: not sexy, but powerful
Business Understanding: can you tie together data, metrics, and business goals?
Communication Skills: can you tell a story with data about data?
Operationalization: can you ship (deploy) models and not just train them?
Data Engineering Literacy: do you understand data quality, data integration, data pipelines, or can you partner with those who do?

Lifelong Learning

Lifelong learning is a mindset that has worked in my favor for decades. It is a growth mindset that builds on curiosity to push your boundaries further out. Faced with a formidable task you do not know how to accomplish, you say to yourself

I don’t know how to do this. Not yet!

Contrast this with a fixed mindset that views your abilities as limited and confines you to a comfort zone, not pushing against the boundaries. With a fixed mindset, “Not yet!” turns into “Not now” or “Not ever”.

A growth mindset and a joy of learning serves you well in data science. Any of the technology revolutions of the last decades had a profound effect on the way we collect, store, analyze, and use data: the internet, social media, cloud computing, artificial intelligence. You cannot just sit out any of these trends, hoping that they will blow over. You have to engage, figure out what they are about, what works and what does not, and how they can help your organization be better at working with data. It takes an ongoing commitment to be curious, to be critical, and to learn.

We might not know what the next disruption looks like, but we know it is coming. A few years ago, large language models were an interesting novelty, today they have changed our view of what is possible with data. At the same time, principal component analysis (PCA), one of the oldest statistical methods, continues to be one of the most important tools for summarization, visualization, and dimension reduction in statistics and machine learning. This tension between old and new is what makes data science such an exciting field.

Pitter patter, let’s get at ’er.

This material was written in Quarto, because it handles multiple programming languages in the same document, works in RStudio like RMarkdown on steroids, incorporates $\LaTeX$ beautifully, and creates great-looking and highly functional documents. To learn more about Quarto books visit https://quarto.org/docs/books.