50 Introduction
Why should data science be concerned with ethics, the moral principles that govern our behavior? Because our discipline, what we do, has impact on people’s lives. And so it behooves us to make sure this impact does not cause harm by perpetuating stereotypes or by mis-allocating/withholding resources.
In their book “Ethics and Data Science”, Loukides, Mason, and Patil (2018) note that,
The hard thing about being an ethical data scientist isn’t understanding ethics. It’s the junction between ethical ideas and practice. It’s doing good data science.
There is no shortage of ethics guidelines and principles for data professionals. The Royal Statistical Society lists five principles:
- Seek to enhance the value of data science for society
- Avoid harm
- Apply and maintain professional competence
- Seek to preserve or increase trustworthiness
- Maintain accountability and oversight
The American Statistical Society lists eight principles for all those who engage in statistical practice, regardless of job title, profession, or degree. In other words, if you work with data, you should observe these principles:
- Professional integrity and accountability
- Integrity of data and methods
- Responsibility to stakeholders
- Responsibilities to research and data subjects
- Responsibilities to members of multidisciplinary teams
- Responsibilities of leaders, supervisors, and mentors
- Responsibilities regarding potential misconduct
Each of these principles describes behaviors and expectations for data scientists, their organizations, supervisors, and team members.
Most people want to be fair, most people do not want to do harm, most people want to behave ethically. Ethical data science is not just about avoiding harm. It is about actively working to create benefit and improve lives through the responsible use of data technology. The challenge is then not to follow this set of principle or that set of principles and to argue about the legalese—that is the domain of lawyers. The challenge is how to do good data science, how to conduct ourselves ethically and how to perform our work in a way that leads to ethical outcomes. It is one thing to follow the principle to get consent from data subjects before collecting or using their data—the ethical principle. It is another thing to do this at scale in a web application and with transparency that explains to data subjects exactly how their data will be handled and used—ethics in practice.
In the words of Loukides, Mason, and Patil (2018),
After the ethical principle, we have to think about the implementation of the ethical principle.
This is not solved by placing professional ethicists on data science teams. It is not solved by hiring people whose job is ethics, but by us behaving ethically and living ethical values. It is not solved by talking about how to use data ethically, but by using data ethically.
50.1 Why Now?
The conversations about ethics, bias, and fairness in data science have intensified over the past decade. Yet, data analysis and decisions based on data are not a new phenomenon. What changed? Why is data ethics a hot button issue today?
More data are collected and data owners want to know how their data are used.
More decisions are based on algorithms derived from data versus algorithms based on expert knowledge or business rules.
Automation enables us to apply algorithms at greater scale.
More data fall under the definition of personal data and it is easier to infer personal data by combining data sources.
Artificial intelligence and machine learning has penetrated domains where it replaces sensing (reading, seeing, listening) rather than logic.
Artificial intelligence is now capable of generating content that can easily be mistaken for human-generated material, and it can fake and impersonate.
Data-driven models are used in situations where much is riding on the decisions: sentencing guidelines, credit approvals, medical diagnosis, identification, …
Greater complexity of data-based models leads to greater opaqueness; many models are inscrutable black boxes that are difficult to explain.
The consequences of bias or unfairness are severe. Beyond legal repercussions when organizations break the law, missteps are damaging to reputation and destructive to the business.
More unintended consequences of using data. A well-intended application can have negative side effects that cause harm and jeopardize the business. We will see several examples.
Internet-based data collection (web scraping, online questionnaires, social media feeds) can lead to large databases of questionable quality and representativeness. This raises questions about inherent bias in the data. It also raises questions about ownership and right to use these data.
Example: Baker Institute: More Automation means More Inequality
In a 2020 study by the Baker Institute for Public Policy at Rice University, the authors conclude that increased automation does not lead to a loss of jobs as much as it leads to an increase in inequality. They separate workers into three coarse groups: skilled workers that represent the top-10% of wages, medium-skilled workers that represent the next 40%, and low-skilled workers that represent the lower half of the wage distribution. The macroeconomic model used in the study is pretty simple, assuming that medium- and low-skilled workers can be replaced by automation and robotics, but not highly skilled workers. In that scenario, the wage share of skilled workers increases, and that of other workers, in particular medium-skilled workers decreases.
Since 2020, things have changed. With the rise of generative AI, skills targeted for possible automation have reached a different, much higher, level.
50.2 Ethical Challenges in Data Science
The core ethical challenges in data science today are
Privacy and Consent
Collecting, storing, and using personal data raises questions about informed consent, data minimization, anonymization, and the right to be forgotten. The aggregation of seemingly innocuous data can reveal sensitive information people never intended to share. Data points that are not revelatory by themselves can be used in combination to discover sensitive information.
Example: The Cambridge Analytica scandal, where personal data from millions of Facebook profiles was harvested without consent and used for political advertising, highlights the potential for privacy breaches in data science.Bias and Fairness
Data reflects historical inequities, and models trained on biased data perpetuate or amplify discrimination. This shows up in hiring algorithms, credit scoring, criminal justice risk assessments, and more. Defining “fairness” itself is not trivial—should we optimize for equal outcomes, equal treatment, or equal opportunity?
Example: Amazon’s AI recruiting tool showed bias against women because it was trained on resumes submitted over 10 years, which were predominantly from men due to male dominance in the tech industry.Transparency and Explainability
Complex models can be black boxes, making it difficult to understand how decisions are reached. This creates accountability problems, especially in high-stakes domains like healthcare or lending. Steps we take to make models perform better often reduce transparency and interpretability, creating a tension (trade-off) between performance and transparency.
Example: In healthcare, data science models are used to diagnose diseases and recommend treatments. Doctors and patients need to understand how these recommendations are made to trust them and use them effectively.Power Asymmetries
Organizations with data and computational resources have significant power over individuals and smaller entities. Data subjects often have little control or even awareness of how their data is used.
Example: Cambridge Analytica, a political consulting firm, harvested personal data from approximately 87 million Facebook users without their explicit consent. The data was obtained through a personality quiz app that only about 270,000 people installed, but which also scraped data from all their Facebook friends. Facebook controls a massive platform with data on billions of users; it sets the terms of service and privacy policies. Users have little practical choice to accept the terms if they want to participate in the platform. Users had virtually no knowledge that their data was collected and used this way, and no way to prevent it, audit it, or seek redress.Data Quality and Integrity
Poor data quality—lack of accuracy, validity, uniqueness, timeliness, or completeness—can lead to incorrect conclusions and harmful decisions.
Example: During the COVID-19 pandemic, inconsistencies in data reporting across regions led to challenges in understanding the spread of the virus and making public health decisions.
In the era of (generative) AI, additional ethical challenges arise.
Dual Use and Misuse
AI capabilities can be weaponized or used for surveillance, manipulation, and control. The same technology can have beneficial and harmful applications.Intellectual Property
Generative AI systems challenge the traditional understanding of fair use in copyright law. Models are trained on vast amounts of data, often without the consent of copyright holders. Who owns AI-generated content and can such content undermine the livelihood of artists, writers, and other creators?Automation and Labor Displacement
AI’s impact on employment raises questions about economic justice, the social contract, and how we distribute benefits and harms of automation.Environmental Costs
Training large models requires enormous computational resources with significant carbon footprints.Long-term Existential Considerations
As AI systems become more capable, questions arise about alignment with human values, autonomous decision-making, and maintaining meaningful human agency.