48 Data Science Tools
48.1 Introduction
Because the “T-shaped” role of the data scientist touches many disciplines and systems, it can take many tools to do the work. From project management tools like J IRA or Asana to machine learning frameworks like TensorFlow, CNTK, PyTorch, or Keras, to multiple programming languages (R, Python, Scala, Java, JavaScript, C++, Go), multiple databases, visualization tools, cloud providers, ETL and ELT tools, DevOps and MLOps tools, dashboard builders. The list goes on and on. The abundance of software tools and frameworks available to data scientists can be overwhelming—and it is growing every day.
A few comments and recommendations:
- You do not need to be an expert at every tool in order to be an expert data scientist.
- You can and should develop your own preferred tech stack.
- Every tool has pros and cons, and everyone has preferences.
- Look at job postings to learn about the tech stack of employers. Some companies have published or described their tech stack online. This will give you an idea of the tools you will encounter at those employers and how companies make choices about their data stack.
- Check Reddit, Stack Overflow, Stack Exchange, GitHub, etc. The degree of activity on these sites is a good indicator for the relevance of a tool.
- In an organization with a small data science team, you will have a different set of tasks and tools than as part of a larger data science team with specialists in data engineering, visualization, etc.
- In research settings the tech stack is smaller since the number of data science tasks is smaller.
- You will find tools built in-house in larger companies. Many organizations use a blend of internal and 3rd-party tools. Some companies have their own forks of open-source software projects.
- Be ready to adapt and learn new tools, technologies, languages.
48.2 Tech Stacks
Employers will have preferences and standards you must comply with. Expertise with one tool facilitates switching to another tool. If your employer is an AWS shop you will not convince them to switch to Google BigQuery. If you have basic SQL skills, you will be able to move from BigQuery to Amazon Redshift. If you are familiar with business intelligence tools like Tableau or Alteryx, switching to Power BI is not a problem.
You will also find that organizations have invested heavily in systems in the past and are slow to move on although more modern and better performing options are available. For example, many companies have built data lakes and machine learning environments based on the Hadoop ecosystem. Although superseded by cloud-based object storage, migrating from a Hadoop cluster is costly and time consuming. Hadoop-based tools such as Hive, Impala, Kudu, Pig, Mahout, Sqoop, and Zookeeper will be around for a while longer. Be ready to work with tools that might not be on the Favorites list.
In the video below from 2023, Chris Wiggins, Chief Data Scientist of the New York Times discusses the evolution of the data science tech stack at the New York Times:
The main takeaways:
- Moving away from Hadoop
- Not afraid of changing cloud providers (from AWS S3 to Google Cloud Platform and BigQuery)
- Reading from and writing back to the same data platform: minimizing data movement, single format, SQL access
- At every stage the tech stack was pretty modern and the company is not afraid to change—good attributes.
Like many other organizations, their data science tech stack is changing, the company is iterating to find what works best for them. They move from “write your own MapReduce jobs against S3 buckets” to “jobs in Hive and Pig” to “started our own Hadoop instance on-prem” to “all in BigQuery and GCP tech stack”: code in Python leveraging scikit-learn, when necessary code in Go, data are read from BigQuery, model output is pushed back to BigQuery, sometimes hosting an API, scheduling using Airflow instance on GCP, containerized.
Here are some examples of tech stacks at companies. These include frontend and backend tools as well as data tools. We dive a bit deeper into the data engineering stack at Meta (formerly Facebook) in the next section.
Here are the tech stacks of some other companies:
Google: Python, Java, AngularJS, Golang, C++, Dart, Preact, K8s, Android Studio, Bazel
Facebook: React, PHP, GraphQL, Cassandra, Memcached, Presto, Flux, Tornado, RocksDB, Jenkins, Chef, Phabricator, Datadog, Confluence
Netflix: Python, Node.js, React, Java, MySQL, PostgreSQL, Flask, AWS (S3, EC2, RDS, DynamoDB, EMR, CloudTrail), Cassandra, Oracle, Hadoop, Presto, Pig, Atlas-DB, GitHub, Jenkins, Gradle, Sumo Logic.
Uber: Python, jQuery, Node.js, React, Java, MySQL, NGINX, PostgreSQL, MongoDB, Redis, Amazon EC2, Kafka, Golang, Cassandra, Apache Spark, Hadoop, AresDB, Terraform, Grafana, Prometheus, Zookeeper. Also see this article on data science at Uber.
Shopify: Python, React, MySQL, NGINX, Redis, GraphQL, Kafka, Goang, Memcached, Apache Spark, Hadoop, dbt, Apache Beam, ElasticSearch, GitHub, Docker, K8s, Datadog, Chef, Zookeeper
Udemy: Python, jQuery, Node.js, React, MySQL, NGINX, CloudFlare, AngularJS, Redis, Django, Spring Boot, Kafka, Kotlin, Memcached, ElasticSearch, GitHub, Docker, Jenkins, K8s, PyCharm, Ansible, Terraform, Sentry, Datadog
The Meta Data Engineering Stack
We picked the stack for data engineering at Meta (fka Facebook) for a deeper examination because their team provided a detailed article that discusses some of the characteristics of data engineering at large, modern companies (Meta 2023):
- Very large data warehouses
- Complex data pipelines
- A mix of commercial and open-source tools
- A mix of in-house and 3rd-party tools
The main data warehouse for analytics consists of a collection of millions of Hive tables stored in ORC (Optimized Row Columnar) format (see Section 13.1.4). Meta maintains its own fork of ORC, which suggests that they optimized the file format for their use cases.
The data warehouse is so large that it cannot be stored in one data center. The data are partitioned geographically and logically into namespaces—groups of tables that are likely used together. Tables in the same namespace are located together in the same data center location to facilitate merges and joins without sending data across geographies. If data needs to be accessed across namespaces, the data are replicated to another namespace so that they can be processed at the same location.
You really have a lot of data if the analytic data needs to be spread across multiple data centers in multiple geographies. The total size of the Meta data warehouse is measured in exabytes (millions of terabytes).
Meta has a strict data retention policy, table partitions older than the table’s retention time are deleted or archived following anonymization of the data.
To find data in such a massive data warehouse, Meta developed its own tool, iData, to search for data by keyword. The iData search engine returns tables ranked by relevance, considering data freshness, number of uses, and number of mentions in posts of the table.
To query the data warehouse, Meta uses Presto and Spark. Presto is an open-source SQL querying engine originally developed by Meta. After open-sourcing Presto, Meta maintains its own internal fork. SQL (Presto SQL or Spark SQL) is key for querying the data at Meta. Presto is used for most day-to-day queries; a light query at Meta’s scale scans through a few billion rows of data. Spark is used for the heavy workloads.
Data exploration and analysis are based on internal tools, Daiquery is the internal tool for querying and visualizing any data source. Bento is an internal implementation of Jupyter notebooks for Python and R code.
Dashboards are created with another internal tool, Unidash.
Data pipelines are written in SQL, wrapped in Python, and orchestrated with Dataswarm, a predecessor of Airflow.
VSCode is the IDE of choice for developing data pipelines and has been enhanced with custom plugins developed internally. For example, a custom linter checks SQL statements. On Save the internal VSCode extension generates the directed acyclic graph (DAG) for the pipeline. The data engineer can then schedule a test run of the pipeline using real data, writing the result to a temporary result table.
48.3 Beware the Hype
There is a lot of hype and excitement about data science, machine learning, data engineering, artificial intelligence, etc. And there are many vendors pushing tools that promise to take away complexity, save you money, simplify your tool chain, handle all your use cases, and improve performance at the same time. The situation is particularly bad for data engineering and machine learning tools. It would be wonderful if there was a single easy button that produces high quality integrated data. It would be wonderful if there was an easy button that develops, trains, validates, deploys, and monitors models.
Saifi (2025b) writes
If you’ve been shopping for data engineering tools lately, you’ve heard these promises. The vendor demos are slick, the free tiers are generous, and the sales engineers have an answer for every objection. Everything looks perfect — until you’re six months into production and your monthly bill just hit five figures, your “simple” solution requires a team of specialists to maintain, and you’re discovering limitations that somehow never came up during the sales process.
Welcome to the uncomfortable reality of data engineering tool selection, where the marketing promises collide head-first with the physics of production workloads. While vendors are busy painting rosy pictures of seamless data utopias, real engineering teams are dealing with hidden costs, unexpected limitations, and vendor lock-in strategies that would make telecom companies blush.
Data Gravity
Data gravity describes the notion that data increases its gravitational pull as it accumulates, pulling in applications and more data. Data gravity also describes the reality that when data comes to rest, it seems to obey a higher gravitational constant—it becomes heavier and more difficult to move. Here are some examples of data gravity:
- Data warehouses and data lakes grow in complexity and scale as they accumulate more data.
- On social media platforms, user data attracts advertisers, which generates more content and user engagement, further increasing the data volume.
- An increasing number of edge devices generate large amounts of data, attracting the development of IoT (Internet of Things) applications and services that process data locally.
- High frequency trading firms co-locate their servers with stock exchange data centers to reduce latency.
Vendors take advantage of data gravity, locking you into their systems once they have a hold of the data:
- Moving data to a cloud provider is free, taking data out incurs egress charges.
- You start with Databricks for machine learning. The data lake or lakehouse fills up with Delta Lake files, models are trained using MLflow, and notebooks are in Databricks format. Security policies are configured for the Databricks platform.
Vendor lock-in is a tried-and-true strategy. When selecting data science tools and frameworks, ask yourself how difficult the technology would be to replace.
Simple or Complex
Saifi (2025a) talks about the great data engineering oversimplification, promising
- a single platform for all your needs
- zero ETL
- no-code data pipelines that anyone can build
- drag-and-drop and AI does the rest
- one-click data integration
- one platform to rule them all
- and so forth
Saifi writes
While the marketing teams are busy promising silver bullets, real data engineers are drowning in a sea of supposedly “simple” tools that break in spectacular ways when they encounter actual business problems. The result? A generation of data professionals who’ve been sold on the myth that complex problems should have simple solutions, only to discover that enterprise data engineering is messier, weirder, and more intricate than any no-code platform wants to admit.
Solutions that claim to handle all use cases and eliminate complexity are built around certain use cases, pre-defined templates, and best practices. The tools might work great if your case matches the template, and can fail spectacularly when it does not.
Saifi (2025a) argues that data engineering is inherently complex and the complexity cannot be abstracted away. Data integration is inherently difficult—no drag-and-drop interface can handle its nuances. No push-button operation can move an offline deployment to an online deployment. Everything becomes more difficult as scale increases. For complex problems organizations should invest in complex solutions. Simple tools are good for simple problems.
How do you decide whether to pursue complexity over simplicity? Saifi (2025a) offers the following framework:
Business Requirements
- Complexity increases with data volumes.
- Complexity increases when data needs to flow faster, real-time is the most difficult situation.
- Complexity increases with more data sources and formats to integrate.
- Complexity increases in highly regulated industries.
Team Capabilities
- Complexity without expertise leads to failure. Do you have the skills on the teams to manage complex tools?
- Complex tools have a higher cost of ownership. Can you afford the training and operational overhead?
- Do you have the time to implement complex solutions?
Long-term Costs
- Will the solution become more expensive with growing requirements?
- Will the solution scale with the business?
- Does the solution have the flexibility to adapt to changing requirements?
The flip side of these arguments for complexity is that simpler solutions often do the trick. So you built a near real-time streaming data solution using Kafka only to find that results are needed weekly. You develop a regression model to predict customer age, only to find that you have their date of birth in the database.
We have a tendency to over-engineer solutions, leading to systems that Shenoy (2025) calls
beautiful in complexity, brutal in maintenance
The root causes of over-engineered solutions according to Shenoy are
- Tool obsession: just because you know all the tools does not mean they need to be used.
- Future-proofing fantasy: designing for future requirements that do not materialize, e.g., scale.
- Too many cooks: different stakeholders have different goals, trying to accommodate everyone does not serve the overall project goal.
- Fear of simple: misunderstanding simple as junior and complex as smart.
48.4 Sampling of Tools
EDA and BI
Business Intelligence (BI) is the processing of organizational data and presenting it in reports and on dashboards. The goal is to help an organization’s operations by using relevant data. Key functions are to monitor, report, and analyze the business operations. BI overlaps with Exploratory Data Analysis (EDA) in that it is highly descriptive, relying on visualization and summarization to inform about what is and has been happening.
Here is a list of some BI tools you will encounter in practice (in no particular order):
- Microsoft Power BI
- Tableau (acquired by Salesforce)
- Heap Analytics
- Metabase
- Mode (acquired by ThoughtSpot)
- ThoughtSpot
- Qlik
- Sisense
- SAP BusinessObjects
- Oracle BI
- TIBCO Spotfire
- AWS QuickSights
- Looker (Looker Studio)
- DOMO
- IBM Cognos Analytics
- MicroStrategy
- Yellowfin
- SAS Augmented Analytics & BI
- JMP
Data Engineering
Data engineers build pipelines that help to collect, merge, cleanse, prepare, and transform data for subsequent analytics. Data engineering defines, creates, and maintains the infrastructure that enables modern data analytics. Key steps in the data engineering workflow are pipelining, data replication, change-data-capture (CDC), ETL (Extract-Transform-Load) and/or ELT (Extract-Load-Transform).
Here is a list of some common tools used in data engineering.
- Dbt Labs. The “t” in dbt is the “T” in ELT. Dbt is a SQL-based data engineering tool that assumes the data is already loaded into the target system. It transforms data where it lives.
- Fivetran. An ELT and data-movement platform with extensive data replication and change-data-capture capabilities.
- CData. Data connectivity, data movement, data sync (CDC).
- Spark. An engine for distributed big-data analytics with interfaces for Python (pySpark), SQL (spark-sql), Scala, Java, and R (sparkR)
- Dask. A parallel-processing framework for Python
- Apache Airflow. A Python-based tool to create and manage workflows. Often used to pipeline data.
- Prefect. Workflow orchestration for data engineers and ML engineers.
- Apache Kafka. Open-source distributed event-streaming platform that is frequently used to move data through streaming pipelines.
- Matillion. Build and orchestrate data pipelines.
- Databases (see Section 13.2)
- ElasticSearch. Distributed search and analytics engine.
- Presto. An open-source, distributed SQL engine for analytic queries
- Redis. An open-source, in-memory data store. Often used as a memory cache.
Data Visualization
- Python-based
- pandas, matplotlib, seaborn, plotly, Vega-altair, plotnine (ggplot)
- R-based
- tidyverse (dplyr, tidyr, ggplot2, shiny)
- Many of the tools listed in Section 48.4.1.
Data Analytics and Machine Learning
Cloud Service Providers
Languages & Packages
- Python-based
- numpy, scipy, pandas, polars, statsmodels, scikit-learn
- Pyspark
- Scala-based
- Spark
- R-based
- Basic modeling capabilities are built into the language
- dplyr, tidyr, caret, gam, glmnet, nnet, KernLab, E1071, RandomForest, tree, gbm, xgboost, lme4, boot, ….
See the CRAN overview for Machine Learning and Statistical Learning - sparkR
- Java: mostly used to put models into production and to build applications rather than model building.
- Deeplearning4j: open-source toolkit for Java to deploy deep neural nets.
- ND4J: n-dimensional array objects for scientific computing
- Golang: Go is used mainly as a language for managing and orchestrating backend architecture but is finding more applications in data orchestration.
Commercial Offerings
- Alteryx
- KNIME
- Domino Data Lab
- DataRobot
- Dataiku
- H20.ai
- RapidMiner (acquired by Altair)
- MindsDB
- Databricks
- SAS Viya
- JMP Pro
- MATLAB
Deep Learning
- TensorFlow
- Keras
- Torch
- PyTorch
- Microsoft Cognitive Toolkit (CNTK)
- OpenAI
- OpenCV
- Viso Suite from viso.ai
- DeepLearningKit for Apple tvOS, iOS, OS X
- H2O.ai
- Caffe from Berkeley AI Research
IDEs and Developer Productivity
- IPython: a command shell for interactive computing
- JupyterLab and Jupyter Notebook
- Spyder: IDE for Python
- VSCode: a code editor with IDE-like plugins
- Visual Studio (an IDE)
- DataSpell from JetBrains
- PyCharm: a commercial IDE for Python from JetBrains
- Google Colab(oratory)
- Git, GitLab, GitHub
- GitHub Copilot
- RStudio (free open-source edition)
Cloud Computing & DevOps
- AWS (Amazon Web Services)
- Microsoft Azure
- GCP (Google Cloud Platform)
- Docker
- Kubernetes (K8s)
- Fly.io
- Cloudflare
- Jenkins (CI/CD)
- Ansible
Web Development
- Svelte
- Vue.js
- Angular
- React
- D3.js (data visualization)
- Laravel (PHP based)
- GraphQL (Graphene, Apollo etc.)
- NodeJS
- Flask
- Django
- Heroku
- Vercel, Next.js
- Netlify
- AWS Amplify
- AWS Lambda
- MongoDB Realm
- Firebase
- DigitalOcean
Programming Languages
- R
- Python
- SQL
- Scala
- Julia
- PHP
- HTML
- CSS
- C/C++
- Go
- Rust
- Java
- JavaScript
- TypeScript