26  Data Science Tools

26.1 Tech Stacks

Because the “T-shaped” role of the data scientist touches many disciplines and systems it can take many tools to do the work. From project management tools like JIRA or Asana to machine learning frameworks like TensorFlow, CNTK, PyTorch, or Keras, to multiple programming languages (R, Python, Scala, Java, JavaScript, C++, Go), multiple databases, visualization tools, cloud providers, ETL and ELT tools, DevOps tools, ModelOps tools, dashboard builders. The list goes on and on. The abundance of software tools and frameworks available to data scientists can be overwhelming—and it is growing every day.

A few comments and recommendations:

  • You do not need to be an expert at every tool in order to be an expert data scientist.

  • You can and should develop your own preferred tech stack.

  • Every tool has pros and cons, and everyone has preferences.

  • Look at job postings to learn about the tech stack of employers. Some companies have published or described their tech stack online. This will give you an idea of the tools you will encounter at those employers and how companies make choices about their data stack.

  • Check Reddit, Stack Overflow, Stack Exchange, GitHub, etc. The degree of activity on these sites is a good indicator for the relevance of a tool.

  • In an organization with a small data science team, you will have a different set of tasks and tools than as part of a larger data science team with specialists in data engineering, visualization, etc.

  • In research settings the tech stack is smaller since the number of data science tasks is smaller.

  • You will find tools built in-house in larger companies. Many organizations use a blend of internal and 3rd-party tools. Some companies have their own forks of open-source software projects.

  • Be ready to adapt and learn new tools, technologies, languages.

Employers will have preferences and standards you must comply with. Expertise with one tool makes switching to another tool easier. If your employer is an AWS shop you will not convince them to switch to Google BigQuery. If you have basic SQL skills, you will be able to move from BigQuery to Amazon Redshift. If you are familiar with business intelligence tools like Tableau or Alteryx, adding Power BI to your toolbox is not a problem.

You will also find that organizations have invested heavily in systems in the past and are slow to move on although more modern and better performing options are available. For example, many companies have built data lakes and machine learning environments based on the Hadoop ecosystem. Although superseded by cloud-based object storage, migrating from a Hadoop cluster is costly and time consuming. Hadoop-based tools such as Hive, Impala, Kudu, Pig, Mahout, Sqoop, and Zookeeper will be around for a while longer. Be ready to work with tools that might not be on the Favorites list.

Example Tech Stacks

Here are some examples of tech stacks at companies. These include frontend and backend tools as well as data tools. We dive a bit deeper into the data engineering stack at Meta (formerly Facebook) in the next section.

In the video below from 2023, Chris Wiggins, Chief Data Scientist of the New York Times discusses the evolution of the data science tech stack at the New York Times:

Like many other organizations, their data science tech stack is changing, the company is iterating to find what works best for them. They move from “write your own MapReduce jobs against S3 buckets” to “jobs in Hive and Pig” to “started our own Hadoop instance” to “all in BigQuery and GCP tech stack”: code in Python leveraging scikit-learn, when necessary code in Go, data are read from BigQuery, model output is pushed back to BigQuery, sometimes hosting an API, scheduling using Airflow instance on GCP, containerized.

What are the takeaways:

  • Moving away from Hadoop
  • Not afraid of changing cloud providers (from AWS S3 to Google Cloud Platform and BigQuery)
  • Reading from and writing back to the same data platform: minimizing data movement, single format, SQL access
  • At every stage the tech stack was pretty modern and the company is not afraid to change—good attributes.

Here are the tech stacks of some other companies:

  • Google: Python, Java, AngularJS, Golang, C++, Dart, Preact, K8s, Android Studio, Bazel

  • Facebook: React, PHP, GraphQL, Cassandra, Memcached, Presto, Flux, Tornado, RocksDB, Jenkins, Chef, Phabricator, Datadog, Confluence

  • Netflix: Python, Node.js, React, Java, MySQL, PostgreSQL, Flask, AWS (S3, EC2, RDS, DynamoDB, EMR, CloudTrail), Cassandra, Oracle, Hadoop, Presto, Pig, Atlas-DB, GitHub, Jenkins, Gradle, Sumo Logic.

  • Uber: Python, jQuery, Node.js, React, Java, MySQL, NGINX, PostgreSQL, MongoDB, Redis, Amazon EC2, Kafka, Golang, Cassandra, Apache Spark, Hadoop, AresDB, Terraform, Grafana, Prometheus, Zookeeper. Also see this article on data science at Uber.

  • Shopify: Python, React, MySQL, NGINX, Redis, GraphQL, Kafka, Goang, Memcached, Apache Spark, Hadoop, dbt, Apache Beam, ElasticSearch, GitHub, Docker, K8s, Datadog, Chef, Zookeeper

  • Udemy: Python, jQuery, Node.js, React, MySQL, NGINX, CloudFlare, AngularJS, Redis, Django, Spring Boot, Kafka, Kotlin, Memcached, ElasticSearch, GitHub, Docker, Jenkins, K8s, PyCharm, Ansible, Terraform, Sentry, Datadog

The Meta Data Engineering Stack

We picked the stack for data engineering at Meta (fka Facebook) for a deeper examination because their team provided a detailed article that discusses some of the characteristics of data engineering at large, modern companies (Meta 2023):

  • Very large data warehouses
  • Complex data pipelines
  • A mix of commercial and open-source tools
  • A mix of in-house and 3rd-party tools

The main data warehouse for analytics consists of a collection of millions of Hive tables stored in ORC (Optimized Row Columnar) format (see Section 10.1.4). Meta maintains its own fork of ORC, which suggests that they optimized the file format for their use cases.

The data warehouse is so large that it cannot be stored in one data center. The data are partitioned geographically and logically into namespaces—groups of tables that are likely used together. Tables in the same namespace are located together in the same data center location to facilitate merges and joins without sending data across geographies. If data needs to be accessed across namespaces, the data are replicated to another namespace so that they can be processed at the same location.

You really have a lot of data if the analytic data needs to be spread across multiple data centers in multiple geographies. The total size of the Meta data warehouse is measured in exabytes (millions of terabytes).

Meta has a strict data retention policy, table partitions older than the table’s retention time are deleted or archived following anonymization of the data.

To find data in such a massive data warehouse, Meta developed its own tool, iData, to search for data by keyword. The iData search engine returns tables ranked by relevance, considering data freshness, number of uses, and number of mentions in posts of the table.

To query the data warehouse, Meta uses Presto and Spark. Presto is an open-source SQL querying engine originally developed by Meta. After open-sourcing Presto, Meta maintains its own internal fork. SQL (Presto SQL or Spark SQL) is key for querying the data at Meta. Presto is used for most day-to-day queries; a light query at Meta’s scale scans through a few billion rows of data. Spark is used for the heavy workloads.

Data exploration and analysis are based on internal tools, Daiquery is the internal tool for querying and visualizing any data source. Bento is an internal implementation of Jupyter notebooks for Python and R code.

Dashboards are created with another internal tool, Unidash.

Data pipelines are written in SQL, wrapped in Python, and orchestrated with Dataswarm, a predecessor of Airflow.

VSCode is the IDE of choice for developing data pipelines and has been enhanced with custom plugins developed internally. For example, a custom linter checks SQL statements. On Save the internal VSCode extension generates the directed acyclic graph for the pipeline. The data engineer can then schedule a test run of the pipeline using real data, writing the result to a temporary result table.

26.2 Exploratory Data Analysis and Business Intelligence

Business Intelligence (BI) is the processing of organizational data and presenting it in reports and on dashboards. The goal is to help an organization’s operations by using relevant data. Key functions are to monitor, report, and analyze the business operations. BI overlaps with Exploratory Data Analysis (EDA) in that it is highly descriptive, relying on visualizations and summarizations to inform about what is and has been happening.

Here is a list of some BI tools you will encounter in practice (in no particular order):

26.3 Data Engineering

Data engineers build pipelines that help to collect, merge, cleanse, prepare, and transform data for subsequent analytics. Data engineering defines, creates, and maintains the infrastructure that enables modern data analytics. Key steps in the data engineering workflow are pipelining, data replication, change-data-capture (CDC), ETL (Extract-Transform-Load) and/or ELT (Extract-Load-Transform).

Here is a list of some common tools used in data engineering.

  • Dbt Labs. The “t” in dbt is the “T” in ELT. Dbt is a SQL-based data engineering tool that assumes the data is already loaded into the target system. It transforms data where it lives.
  • Fivetran. An ELT and data-movement platform with extensive data replication and change-data-capture capabilities.
  • CData. Data connectivity, data movement, data sync (CDC).
  • Spark. An engine for distributed big-data analytics with interfaces for Python (pySpark), SQL (spark-sql), Scala, Java, and R (sparkR)
  • Dask. A parallel-processing framework for Python
  • Apache Airflow. A Python-based tool to create and manage workflows. Often used to pipeline data.
  • Prefect. Workflow orchestration for data engineers and ML engineers.
  • Apache Kafka. Open-source distributed event-streaming platform that is frequently used to move data through streaming pipelines.
  • Matillion. Build and orchestrate data pipelines.
  • Databases (see Section 10.2)
  • ElasticSearch. Distributed search and analytics engine.
  • Presto. An open-source, distributed SQL engine for analytic queries
  • Redis. An open-source, in-memory data store. Often used as a memory cache.

26.4 Data Visualization

  • Python-based
    • pandas, matplotlib, seaborn, plotly, Vega-altair, plotnine (ggplot)
  • R-based
    • tidyverse (dplyr, tidyr, ggplot2, shiny)
  • Many of the tools listed in Section 26.2.

26.5 Data Analytics and Machine Learning

Cloud Service Providers

Languages & Packages

  • Python-based
    • numpy, scipy, pandas, polars, statsmodels, scikit-learn
    • Pyspark
  • Scala-based
    • Spark
  • R-based
    • Basic modeling capabilities are built into the language
    • dplyr, tidyr, caret, gam, glmnet, nnet, KernLab, E1071, RandomForest, tree, gbm, xgboost, lme4, boot, …. See the CRAN overview for Machine Learning and Statistical Learning
    • sparkR
  • Java: mostly used to put models into production and to build applications rather than model building.
    • Deeplearning4j: open-source toolkit for Java to deploy deep neural nets.
    • ND4J: n-dimensional array objects for scientific computing
  • Golang: Go is used mainly as a language for managing and orchestrating backend architecture but is finding more applications in data orchestration.

Commercial Offerings

26.6 Deep Learning

26.7 IDEs and Developer Productivity

26.8 Cloud Computing & DevOps

26.9 Web Development

26.10 Programming Languages

  • R
  • Python
  • SQL
  • Scala
  • Julia
  • PHP
  • HTML
  • CSS
  • C/C++
  • Go
  • Rust
  • Java
  • JavaScript
  • TypeScript