16  Data Integration

Data integration is the process of bringing together data from different sources and transforming and combining them into a more valuable form. The goal is to make the data more meaningful and richer for downstream applications, to increase the quality of the data, and to facilitate analytic processing. Pollution data from the EPA informs us about risk factors, hospital admissions data tells us why patients are being seen and treated. By combining the two data sources we can study the association between pollution and health outcomes. Geo-referenced data are often aligned to different units. For example, there are 41,642 unique zip codes in the U.S. and 73,057 census tracts. An analysis of data using zip codes might want to include information from the latest census. An analysis of census data might want to pull in zip-code-level data from municipal sources.

There are many challenges in data integration:

16.1 Combining Tables with SQL

The process of combining tables is based on set operations or joins. A join uses the values in specific columns of the tables to match records. A set operation is a merging of columns without considering the values in the columns. Appending the rows of one table to another table is a set operation. What happens to columns that exist in one table but not in the other during the append depends on the implementation. Similarly, what happens to columns that share the same name when tables are merged horizontally depends on the implementation.

Set Operations

We draw on an example in the DuckDB documentation to explain how tables can be merged by rows or columns. First, let’s create two small tables with information on cities:

CREATE TABLE capitals(city VARCHAR, country VARCHAR);
INSERT INTO capitals VALUES ('Amsterdam', 'NL'), ('Berlin', 'Germany');

CREATE TABLE weather(city VARCHAR, degrees INTEGER, date DATE);
INSERT INTO weather VALUES ('Amsterdam', 10, '2022-10-14'), 
                           ('Seattle', 8, '2022-10-12');
FROM capitals;

┌───────────┬─────────┐
│   city    │ country │
│  varchar  │ varchar │
├───────────┼─────────┤
│ Amsterdam │ NL      │
│ Berlin    │ Germany │
└───────────┴─────────┘
FROM weather;

┌───────────┬─────────┬────────────┐
│   city    │ degrees │    date    │
│  varchar  │  int32  │    date    │
├───────────┼─────────┼────────────┤
│ Amsterdam │      10 │ 2022-10-14 │
│ Seattle   │       8 │ 2022-10-12 │
└───────────┴─────────┴────────────┘  

Set operations involve two SELECT queries. The queries are connected with clauses that control how the rows are combined:

  • UNION: combines rows from queries that have the same columns (number and types) and eliminates duplicates.

  • UNION ALL: combines rows from queries that have the same columns (number and types) and preserves duplicates.

  • INTERSECT: selects rows that occur in both queries and removes duplicates.

  • EXCEPT: selects rows that occur in the left query and removes duplicates.

  • UNION BY NAME: works like UNION but does not require the queries to have the same number and types of columns. Eliminates duplicates, like UNION.

  • UNION ALL BY NAME: works like UNION BY NAME but does not eliminate duplicate rows.

Here are the result of UNION and UNION ALL clauses:

SELECT city FROM capitals UNION SELECT city FROM weather;

┌───────────┐
│   city    │
│  varchar  │
├───────────┤
│ Amsterdam │
│ Seattle   │
│ Berlin    │
└───────────┘
SELECT city FROM capitals UNION ALL SELECT city FROM weather;

┌───────────┐
│   city    │
│  varchar  │
├───────────┤
│ Amsterdam │
│ Berlin    │
│ Amsterdam │
│ Seattle   │
└───────────┘ 

Here are the results of INTERSECT and EXCEPT clauses:

SELECT city FROM capitals INTERSECT SELECT city FROM weather;

┌───────────┐
│   city    │
│  varchar  │
├───────────┤
│ Amsterdam │
└───────────┘
SELECT city FROM capitals EXCEPT SELECT city FROM weather;

┌─────────┐
│  city   │
│ varchar │
├─────────┤
│ Berlin  │
└─────────┘ 

Notice that UNION, UNION ALL, INTERSECT, and EXCEPT require the two queries to return the same columns. The horizontal column matching is done by position. To merge rows across tables with different columns, use UNION (ALL) BY NAME:

SELECT * FROM capitals UNION BY NAME SELECT * FROM weather;

┌───────────┬─────────┬─────────┬────────────┐
│   city    │ country │ degrees │    date    │
│  varchar  │ varchar │  int32  │    date    │
├───────────┼─────────┼─────────┼────────────┤
│ Amsterdam │ NL      │         │            │
│ Seattle   │         │       8 │ 2022-10-12 │
│ Berlin    │ Germany │         │            │
│ Amsterdam │         │      10 │ 2022-10-14 │
└───────────┴─────────┴─────────┴────────────┘ 

Joins

The previous set operations combine rows of data (vertically). To combine tables horizontally we use join operations. Joins typically are based on the values in columns of the tables to find matches. There are two exceptions, positional and cross joins.

For data scientists working with rectangular data frames in which observations have a natural order, merging data horizontally is a standard operation. In relational databases this is a somewhat unnatural operation because relational tables do not work from a natural ordering of the data, they are based on keys and indices. The positional join matches row-by-row such that rows from both tables appear at least once:

select capitals.*, weather.* from capitals positional join weather;

┌───────────┬─────────┬───────────┬─────────┬────────────┐
│   city    │ country │   city    │ degrees │    date    │
│  varchar  │ varchar │  varchar  │  int32  │    date    │
├───────────┼─────────┼───────────┼─────────┼────────────┤
│ Amsterdam │ NL      │ Amsterdam │      10 │ 2022-10-14 │
│ Berlin    │ Germany │ Seattle   │       8 │ 2022-10-12 │
└───────────┴─────────┴───────────┴─────────┴────────────┘ 

The cross join is actually the simplest join, it returns all possible pairs of rows:

select capitals.*, weather.* from capitals cross join weather;

┌───────────┬─────────┬───────────┬─────────┬────────────┐
│   city    │ country │   city    │ degrees │    date    │
│  varchar  │ varchar │  varchar  │  int32  │    date    │
├───────────┼─────────┼───────────┼─────────┼────────────┤
│ Amsterdam │ NL      │ Amsterdam │      10 │ 2022-10-14 │
│ Amsterdam │ NL      │ Seattle   │       8 │ 2022-10-12 │
│ Berlin    │ Germany │ Amsterdam │      10 │ 2022-10-14 │
│ Berlin    │ Germany │ Seattle   │       8 │ 2022-10-12 │
└───────────┴─────────┴───────────┴─────────┴────────────┘ 

Joins are categorized as outer or inner joins depending on whether rows with matches are returned. An outer join returns rows that do not have any matches whereas the inner join returns only rows that get paired. The two tables in a join are called the left and right sides of the relation and outer joins are further classified as

  • Left outer join: all rows from the left side of the relation appear at least once.

  • Right outer join: all rows from the right side of the relation appear at least once.

  • Full outer join: all rows from both sides of the relation appear at least once.

To demonstrate the joins in DuckDB, let’s set up some simple tables:

CREATE TABLE weather (
      city           VARCHAR,
      temp_lo        INTEGER, -- minimum temperature on a day
      temp_hi        INTEGER, -- maximum temperature on a day
      prcp           REAL,
      date           DATE
  );
CREATE TABLE cities (
      name            VARCHAR,
      lat             DECIMAL,
      lon             DECIMAL
  );
INSERT INTO weather VALUES ('San Francisco', 46, 50, 0.25, '1994-11-27');
INSERT INTO weather (city, temp_lo, temp_hi, prcp, date)
      VALUES ('San Francisco', 43, 57, 0.0, '1994-11-29');
INSERT INTO weather (date, city, temp_hi, temp_lo)
      VALUES ('1994-11-29', 'Hayward', 54, 37);
FROM weather;

┌───────────────┬─────────┬─────────┬───────┬────────────┐
│     city      │ temp_lo │ temp_hi │ prcp  │    date    │
│    varchar    │  int32  │  int32  │ float │    date    │
├───────────────┼─────────┼─────────┼───────┼────────────┤
│ San Francisco │      46 │      50 │  0.25 │ 1994-11-27 │
│ San Francisco │      43 │      57 │   0.0 │ 1994-11-29 │
│ Hayward       │      37 │      54 │       │ 1994-11-29 │
└───────────────┴─────────┴─────────┴───────┴────────────┘ 
INSERT INTO cities VALUES ('San Francisco', -194.0, 53.0);
FROM cities;

┌───────────────┬───────────────┬───────────────┐
│     name      │      lat      │      lon      │
│    varchar    │ decimal(18,3) │ decimal(18,3) │
├───────────────┼───────────────┼───────────────┤
│ San Francisco │      -194.000 │        53.000 │
└───────────────┴───────────────┴───────────────┘ 

An inner join between the tables on the columns that contain the city names will match the records for San Francisco:

SELECT * FROM weather INNER JOIN cities ON (weather.city = cities.name);

┌───────────────┬─────────┬─────────┬───────┬────────────┬───────────────┬───────────────┬───────────────┐
│     city      │ temp_lo │ temp_hi │ prcp  │    date    │     name      │      lat      │      lon      │
│    varchar    │  int32  │  int32  │ float │    date    │    varchar    │ decimal(18,3) │ decimal(18,3) │
├───────────────┼─────────┼─────────┼───────┼────────────┼───────────────┼───────────────┼───────────────┤
│ San Francisco │      46 │      50 │  0.25 │ 1994-11-27 │ San Francisco │      -194.000 │        53.000 │
│ San Francisco │      43 │      57 │   0.0 │ 1994-11-29 │ San Francisco │      -194.000 │        53.000 │
└───────────────┴─────────┴─────────┴───────┴────────────┴───────────────┴───────────────┴───────────────┘ 

Note that the values for lat and lon are repeated for every row in the weather table that matches the join in the relation. Because this is an inner join (the DuckDB default), and the weather table had no matching row for city Hayward, this city does not appear in the join result. We can change that by modifying the type of join to a left outer join:

SELECT * FROM weather LEFT OUTER JOIN cities ON (weather.city = cities.name);

┌───────────────┬─────────┬─────────┬───────┬────────────┬───────────────┬───────────────┬───────────────┐
│     city      │ temp_lo │ temp_hi │ prcp  │    date    │     name      │      lat      │      lon      │
│    varchar    │  int32  │  int32  │ float │    date    │    varchar    │ decimal(18,3) │ decimal(18,3) │
├───────────────┼─────────┼─────────┼───────┼────────────┼───────────────┼───────────────┼───────────────┤
│ San Francisco │      46 │      50 │  0.25 │ 1994-11-27 │ San Francisco │      -194.000 │        53.000 │
│ San Francisco │      43 │      57 │   0.0 │ 1994-11-29 │ San Francisco │      -194.000 │        53.000 │
│ Hayward       │      37 │      54 │       │ 1994-11-29 │               │               │               │
└───────────────┴─────────┴─────────┴───────┴────────────┴───────────────┴───────────────┴───────────────┘

Because the join is an outer join, rows that do not have matches in the relation are returned. Because the outer join is a left join, every row on the left side of the relation is returned (at least once). If you change the left- and right-hand side of the relation you can achieve the same result by using a right outer join:

SELECT * FROM cities  RIGHT OUTER JOIN weather ON (weather.city = cities.name);

┌───────────────┬───────────────┬───────────────┬───────────────┬─────────┬─────────┬───────┬────────────┐
│     name      │      lat      │      lon      │     city      │ temp_lo │ temp_hi │ prcp  │    date    │
│    varchar    │ decimal(18,3) │ decimal(18,3) │    varchar    │  int32  │  int32  │ float │    date    │
├───────────────┼───────────────┼───────────────┼───────────────┼─────────┼─────────┼───────┼────────────┤
│ San Francisco │      -194.000 │        53.000 │ San Francisco │      46 │      50 │  0.25 │ 1994-11-27 │
│ San Francisco │      -194.000 │        53.000 │ San Francisco │      43 │      57 │   0.0 │ 1994-11-29 │
│               │               │               │ Hayward       │      37 │      54 │       │ 1994-11-29 │
└───────────────┴───────────────┴───────────────┴───────────────┴─────────┴─────────┴───────┴────────────┘ 

Now let’s add another record to the cities table without a matching record in the weather table:

INSERT INTO cities VALUES ('New York',40.7, -73.9);
FROM cities;

┌───────────────┬───────────────┬───────────────┐
│     name      │      lat      │      lon      │
│    varchar    │ decimal(18,3) │ decimal(18,3) │
├───────────────┼───────────────┼───────────────┤
│ San Francisco │      -194.000 │        53.000 │
│ New York      │        40.700 │       -73.900 │
└───────────────┴───────────────┴───────────────┘ 

A full outer join between the two tables ensures that rows from both sides of the relation appear at least once:

SELECT * FROM cities FULL OUTER JOIN weather ON (weather.city = cities.name);

┌───────────────┬───────────────┬───────────────┬───────────────┬─────────┬─────────┬───────┬────────────┐
│     name      │      lat      │      lon      │     city      │ temp_lo │ temp_hi │ prcp  │    date    │
│    varchar    │ decimal(18,3) │ decimal(18,3) │    varchar    │  int32  │  int32  │ float │    date    │
├───────────────┼───────────────┼───────────────┼───────────────┼─────────┼─────────┼───────┼────────────┤
│ San Francisco │      -194.000 │        53.000 │ San Francisco │      46 │      50 │  0.25 │ 1994-11-27 │
│ San Francisco │      -194.000 │        53.000 │ San Francisco │      43 │      57 │   0.0 │ 1994-11-29 │
│               │               │               │ Hayward       │      37 │      54 │       │ 1994-11-29 │
│ New York      │        40.700 │       -73.900 │               │         │         │       │            │
└───────────────┴───────────────┴───────────────┴───────────────┴─────────┴─────────┴───────┴────────────┘