4  Wrangling Data

4.1 Introduction

Wrangling data refers to the steps to organize the data in a structured format that is suitable for analytic processing. Typically, the result of data wrangling is data in a row-column layout where each analysis variable is in its own column and each observation is in its own row. Data often do not start out that way.

We saw a bit of data wrangling in the previous chapter (Section 3.3) when a nested JSON structure was flattened into a row-column format.

In addition to structuring the data, wrangling includes the following steps:

  1. Cleaning involves removing or correcting inaccurate data, handling duplicates, and addressing anomalies that could impact the reliability of analyses. The focus is on enhancing data accuracy and reliability for analytic processing.

  2. Enriching involves creating additional variables, incorporating external data, and combining the data with other data sources. Any new data sources you bring needs to be structured and cleaned as well.

  3. Validation checks for inconsistencies and verifies data integrity. Your organization will have standards, for example how regions, customer information, names, ages, etc. are represented in data sets, and this step ensures that the standards are met.

4.2 tidyverse

In this chapter we concentrate on structuring, enriching, and combining data with the libraries in the tidyverse. This is an “opinionated” collection of R packages for data science, including dplyr, ggplot2, tibble, tidyr, purr, stringr, and readr. dplyr, tidyr, and ggplot2 are arguably the most important packages in the collection (that is my “opinionated” view). For data wrangling we rely on dplyr and tidyr.

You can install the packages individually or grab all the tidyverse packages with

install.packages("tidyverse")

A great cheat sheet for wrangling data in R with dplyr and tidyr can be found here. This chapter draws on the contents of this cheat sheet.

Piping

The tidyverse packages share a common philosophy; that makes it easy to use code across multiple packages and to combine them. An example of this common philosophy is piping. The following pipeline starts with the iris data frame. The data is piped to the filter function with the pipe operator %>%. The result of the filter operation is piped to the summarize function to compute the average petal width of the plants with sepal length greater than 6:

library(tidyverse)
iris %>% 
    filter(Sepal.Length > 6) %>%
    summarize(average=mean(Petal.Width))
   average
1 1.847541

By default the argument on the left side of the pipe operator is passed as the first argument to the function on the right side. The filter function call is really dplyr::filter(iris,Species=="virginica). You can also pass the argument on the left of the pipe to a different argument on the right side of the pipe by using a dot:

x %>% f(y) is the same as f(x,y)

x %>% f(y, ., z) is the same as f(y,x,z)

Piping not only shows analytic operations as a sequence of steps, but also reduces the amount of code and makes it generally more readable. Learning to write good pipelines.

Data wrangling with dplyr and tidyr can be organized into the following steps:

  1. Shaping the data
  2. Subsetting observations and/or variables
  3. Creating new variables
  4. Combining data sets
  5. Summarization

Summarizing data is discussed as a separate step in Chapter 6.

4.3 Shaping Data into Tabular Form

The goal of shaping the data is to change its structure into a tabular row-column form where each variable is in a separate column and each observation is in its own row.

Consider the classic airline passenger time series data from Box, jenkins, and Reinsel (1976), showing monthly totals of international airline passengers between 1994 and 1960 (in thousands).

AirPassengers
     Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1949 112 118 132 129 121 135 148 148 136 119 104 118
1950 115 126 141 135 125 149 170 170 158 133 114 140
1951 145 150 178 163 172 178 199 199 184 162 146 166
1952 171 180 193 181 183 218 230 242 209 191 172 194
1953 196 196 236 235 229 243 264 272 237 211 180 201
1954 204 188 235 227 234 264 302 293 259 229 203 229
1955 242 233 267 269 270 315 364 347 312 274 237 278
1956 284 277 317 313 318 374 413 405 355 306 271 306
1957 315 301 356 348 355 422 465 467 404 347 305 336
1958 340 318 362 348 363 435 491 505 404 359 310 337
1959 360 342 406 396 420 472 548 559 463 407 362 405
1960 417 391 419 461 472 535 622 606 508 461 390 432

The data are arranged very neatly, each year in a separate row, each month in a separate column. This is a great layout for time series analysis but maybe not the best layout for general analytic processing. The analysis variables are year, month, and the passenger count. Restructuring the AirPassengers data into a data frame with variables for year and count is easy because this data set is stored as an R time series.

str(AirPassengers)
 Time-Series [1:144] from 1949 to 1961: 112 118 132 129 121 135 148 148 136 119 ...
dat <- data.frame(count=as.matrix(AirPassengers), 
                  time=time(AirPassengers),
                  year=round(time(AirPassengers),0))
head(dat)
  count     time year
1   112 1949.000 1949
2   118 1949.083 1949
3   132 1949.167 1949
4   129 1949.250 1949
5   121 1949.333 1949
6   135 1949.417 1949

Gapminder Data

In general, the structure of a non-tabular format is not known to R and must be wrangled by the user. Consider the famous Gapminder dataset. The Gapminder Foundation is a Swedish non-profit that promotes the United Nations sustainability goals through use of statistics. The Gapminder data set tracks economic and social indicators like life expectancy and the GDP per capita of countries over time.

A wide format of the Gapminder data is available on GitHub here. We are following the material in this excellent online resource in wrangling the data in the wide format.

gap_wide <- read.csv(file="data/gapminder_wide.csv")
str(gap_wide)
'data.frame':   142 obs. of  38 variables:
 $ continent     : chr  "Africa" "Africa" "Africa" "Africa" ...
 $ country       : chr  "Algeria" "Angola" "Benin" "Botswana" ...
 $ gdpPercap_1952: num  2449 3521 1063 851 543 ...
 $ gdpPercap_1957: num  3014 3828 960 918 617 ...
 $ gdpPercap_1962: num  2551 4269 949 984 723 ...
 $ gdpPercap_1967: num  3247 5523 1036 1215 795 ...
 $ gdpPercap_1972: num  4183 5473 1086 2264 855 ...
 $ gdpPercap_1977: num  4910 3009 1029 3215 743 ...
 $ gdpPercap_1982: num  5745 2757 1278 4551 807 ...
 $ gdpPercap_1987: num  5681 2430 1226 6206 912 ...
 $ gdpPercap_1992: num  5023 2628 1191 7954 932 ...
 $ gdpPercap_1997: num  4797 2277 1233 8647 946 ...
 $ gdpPercap_2002: num  5288 2773 1373 11004 1038 ...
 $ gdpPercap_2007: num  6223 4797 1441 12570 1217 ...
 $ lifeExp_1952  : num  43.1 30 38.2 47.6 32 ...
 $ lifeExp_1957  : num  45.7 32 40.4 49.6 34.9 ...
 $ lifeExp_1962  : num  48.3 34 42.6 51.5 37.8 ...
 $ lifeExp_1967  : num  51.4 36 44.9 53.3 40.7 ...
 $ lifeExp_1972  : num  54.5 37.9 47 56 43.6 ...
 $ lifeExp_1977  : num  58 39.5 49.2 59.3 46.1 ...
 $ lifeExp_1982  : num  61.4 39.9 50.9 61.5 48.1 ...
 $ lifeExp_1987  : num  65.8 39.9 52.3 63.6 49.6 ...
 $ lifeExp_1992  : num  67.7 40.6 53.9 62.7 50.3 ...
 $ lifeExp_1997  : num  69.2 41 54.8 52.6 50.3 ...
 $ lifeExp_2002  : num  71 41 54.4 46.6 50.6 ...
 $ lifeExp_2007  : num  72.3 42.7 56.7 50.7 52.3 ...
 $ pop_1952      : num  9279525 4232095 1738315 442308 4469979 ...
 $ pop_1957      : num  10270856 4561361 1925173 474639 4713416 ...
 $ pop_1962      : num  11000948 4826015 2151895 512764 4919632 ...
 $ pop_1967      : num  12760499 5247469 2427334 553541 5127935 ...
 $ pop_1972      : num  14760787 5894858 2761407 619351 5433886 ...
 $ pop_1977      : num  17152804 6162675 3168267 781472 5889574 ...
 $ pop_1982      : num  20033753 7016384 3641603 970347 6634596 ...
 $ pop_1987      : num  23254956 7874230 4243788 1151184 7586551 ...
 $ pop_1992      : num  26298373 8735988 4981671 1342614 8878303 ...
 $ pop_1997      : num  29072015 9875024 6066080 1536536 10352843 ...
 $ pop_2002      : int  31287142 10866106 7026113 1630347 12251209 7021078 15929988 4048013 8835739 614382 ...
 $ pop_2007      : int  33333216 12420476 8078314 1639131 14326203 8390505 17696293 4369038 10238807 710960 ...

The data are stored in wide format, annual values for the variables GDP, life expectancy, and population appear in separate columns. This is not unlike how the AirPassengers data is displayed, but gap_wide is not a time series object. The desired (“tidy”) way of structuring the data, where each variable is a column and each observation is a row is a tabular structure with variables

  • Continent
  • Country
  • Year
  • GDP
  • Life Expectancy
  • Population

For each combination of continent and country there will be 12 observations, one for each year.

Gather columns

To move from wide to the desired long format, we use the dplyr::gather function. To do the opposite, moving from long to wide format, use the dplyr::spread function. To restructure the Gapminder data set from wide format into the desired format takes several steps.

The first step is to gather the columns that contain the values for GDP, life expectancy, and population into separate rows. Those are all columns in gap_wide except continent and country, so we can exclude those from the gathering operation with -c(continent, country).

gap_step1 <- gap_wide %>% 
    gather(key=vartype_year,
          value=values,
          -c(continent, country))
head(gap_step1)
  continent      country   vartype_year    values
1    Africa      Algeria gdpPercap_1952 2449.0082
2    Africa       Angola gdpPercap_1952 3520.6103
3    Africa        Benin gdpPercap_1952 1062.7522
4    Africa     Botswana gdpPercap_1952  851.2411
5    Africa Burkina Faso gdpPercap_1952  543.2552
6    Africa      Burundi gdpPercap_1952  339.2965
tail(gap_step1)
     continent        country vartype_year   values
5107    Europe         Sweden     pop_2007  9031088
5108    Europe    Switzerland     pop_2007  7554661
5109    Europe         Turkey     pop_2007 71158647
5110    Europe United Kingdom     pop_2007 60776238
5111   Oceania      Australia     pop_2007 20434176
5112   Oceania    New Zealand     pop_2007  4115771

The resulting data frame has 5112 rows and 4 rows. How do the number of rows come about?

  • The original data has 142 rows
  • Three variables (gdpPercap, lifeExp,pop`) are measured for 12 years
  • There are 142 combinations of continent and country(the number of observations ingap_wide`).

Separating character variables

The gather function turns all variables into key-value pairs; the key is the name of the variable. The column that contains the names of the variables after the gathering is called vartype_year. In the next step dplyr::separate is called to split the character column vartype_year into two columns, one for the variable type and one for the year. The convert=TRUE option attempts to convert the data type of the new columns from character to numeric data—this fails for the vartype column but succeeds for the year column.

gap_step2 <- gap_wide %>% 
    gather(key =vartype_year,
          value=values,
          -c(continent, country)) %>%
    separate(vartype_year,
           into   =c('vartype','year'),
           sep    ="_",
           convert=TRUE)
str(gap_step2)
'data.frame':   5112 obs. of  5 variables:
 $ continent: chr  "Africa" "Africa" "Africa" "Africa" ...
 $ country  : chr  "Algeria" "Angola" "Benin" "Botswana" ...
 $ vartype  : chr  "gdpPercap" "gdpPercap" "gdpPercap" "gdpPercap" ...
 $ year     : int  1952 1952 1952 1952 1952 1952 1952 1952 1952 1952 ...
 $ values   : num  2449 3521 1063 851 543 ...
head(gap_step2)
  continent      country   vartype year    values
1    Africa      Algeria gdpPercap 1952 2449.0082
2    Africa       Angola gdpPercap 1952 3520.6103
3    Africa        Benin gdpPercap 1952 1062.7522
4    Africa     Botswana gdpPercap 1952  851.2411
5    Africa Burkina Faso gdpPercap 1952  543.2552
6    Africa      Burundi gdpPercap 1952  339.2965
tail(gap_step2)
     continent        country vartype year   values
5107    Europe         Sweden     pop 2007  9031088
5108    Europe    Switzerland     pop 2007  7554661
5109    Europe         Turkey     pop 2007 71158647
5110    Europe United Kingdom     pop 2007 60776238
5111   Oceania      Australia     pop 2007 20434176
5112   Oceania    New Zealand     pop 2007  4115771

Spreading rows into columns

We are almost there, but not quite. We now have columns for continent, country, and year, but the values column contains values of different types: 1,704 observations for GDP, 1,704 observations for life expectancy, and 1,704 observations for population. To create separate columns from the rows we can reverse the gather operation with the spread function—it splits key-value pairs across multiple columns. The entries in the vartype column are used by the spreading operation as the names of the new columns.

gapminder <- 
    gap_wide %>% 
    gather(key =vartype_year,
          value=values,
          -c(continent, country)) %>%
    separate(vartype_year,
           into   =c('vartype','year'),
           sep    ="_",
           convert=TRUE) %>%
    spread(vartype, values)

head(gapminder)
  continent country year gdpPercap lifeExp      pop
1    Africa Algeria 1952  2449.008  43.077  9279525
2    Africa Algeria 1957  3013.976  45.685 10270856
3    Africa Algeria 1962  2550.817  48.303 11000948
4    Africa Algeria 1967  3246.992  51.407 12760499
5    Africa Algeria 1972  4182.664  54.518 14760787
6    Africa Algeria 1977  4910.417  58.014 17152804
tail(gapminder)
     continent     country year gdpPercap lifeExp     pop
1699   Oceania New Zealand 1982  17632.41  73.840 3210650
1700   Oceania New Zealand 1987  19007.19  74.320 3317166
1701   Oceania New Zealand 1992  18363.32  76.330 3437674
1702   Oceania New Zealand 1997  21050.41  77.550 3676187
1703   Oceania New Zealand 2002  23189.80  79.110 3908037
1704   Oceania New Zealand 2007  25185.01  80.204 4115771

In summary, we used gather to create one key-value pair of the variable_year columns and values in step 1, separated out the year in step 2, and spread the remaining variable types back out into columns in step 3.

4.4 Subsetting

A subset operation reduces the number of observations (rows) or variables (columns) of data set. While filter is the most important operation to subset rows, there are a number of other functions in dplyr that reduce the rows. On the other hand, to subset columns with dplyr there is only one function, dplyr::select.

Subsetting Rows

The following statements create a data frame of dimension 100 x 2, containing two columns of integers randomly selected from 1–10 with replacement:

set.seed(76)
dat <- data.frame(
  x = sample(10, 100, rep = TRUE),
  y = sample(10, 100, rep = TRUE)) %>% 
    glimpse()
Rows: 100
Columns: 2
$ x <int> 5, 1, 2, 10, 7, 5, 6, 10, 8, 6, 4, 6, 3, 1, 8, 5, 7, 6, 4, 1, 3, 10,…
$ y <int> 3, 5, 9, 8, 7, 10, 4, 1, 10, 6, 9, 6, 2, 9, 3, 8, 9, 7, 4, 3, 2, 3, …

filter

filter extracts rows that meet a logical condition. The following statement selects the rows for which \(x \in (1,5)\) and \(y \in (3,8)\).

dat %>% filter(x %in% c(1,5) & y %in% c(3,8))
  x y
1 5 3
2 5 8
3 1 3
4 5 8
5 5 8
6 5 3

You can also list the conditions that are combined with logical “and”, separated with commas:

dat %>% filter(x %in% c(1,5), y %in% c(3,8))
  x y
1 5 3
2 5 8
3 1 3
4 5 8
5 5 8
6 5 3

The following statement extracts the rows where the value of y exceeds 3.5 times its standard deviation:

dat %>% filter(y > 3.5*sd(y))
   x  y
1  5 10
2  8 10
3  7 10
4  1 10
5  8 10
6  1 10
7  9 10
8  9 10
9  6 10
10 7 10
11 4 10
12 5 10

Another subsetting operation is to remove duplicate observations from a data set with distinct. When applied to a subset of the columns, distinct returns the combinations of their unique values in the data.

distinct

dat %>% distinct(x)
    x
1   5
2   1
3   2
4  10
5   7
6   6
7   8
8   4
9   3
10  9

If you specify .keep_all=TRUE, all other variables in the data set are kept as well. If multiple combinations occur, the function retains the first occurrence.

dat %>% distinct(x, .keep_all=TRUE)
    x  y
1   5  3
2   1  5
3   2  9
4  10  8
5   7  7
6   6  4
7   8 10
8   4  9
9   3  2
10  9  3

To remove all duplicates in a data frame, simply call distinct on the data frame.

nuniq <- dat %>% distinct() %>% summarize(count=n())
nuniq
  count
1    65

There are 65 unique rows of data in the dat data frame.

Other subsetting functions

Other functions subsetting rows in dplyr are

  1. dplyr::sample_frac: randomly select a proportion of the rows
  2. dplyr::sample_n: randomly select a fixed number of rows
  3. dplyr::slice: select rows by position, for example slice(dat,4:10) extracts rows 4–10
  4. dplyr::slice_head: selects the first rows
  5. dplyr::slice_tail: selects the last rows
  6. dplyr::slice_min: select rows with the smallest values
  7. dplyr::slice_max: select rows with the largest values

For example, the next statement selects the observation with the five largest values for the Sepal.Width variable in the iris data set.

iris %>% slice_max(Sepal.Width, n=5)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.7         4.4          1.5         0.4  setosa
2          5.5         4.2          1.4         0.2  setosa
3          5.2         4.1          1.5         0.1  setosa
4          5.8         4.0          1.2         0.2  setosa
5          5.4         3.9          1.7         0.4  setosa
6          5.4         3.9          1.3         0.4  setosa

Subsetting (selecting) Columns

select

To subset columns there is only one statement in dplyr, the select statement. However, it is very versatile because of its many helper functions.

The basic usage is to list the column names being selected:

gapminder %>% select(gdpPercap, lifeExp) %>%
    summarize(mnGDP=mean(gdpPercap), sdLife=sd(lifeExp))
     mnGDP   sdLife
1 7215.327 12.91711

You can also specify an exclusion with a negative selection

gapminder %>% select(-country, -continent, -pop) %>%
    summarize(mnGDP=mean(gdpPercap), sdLife=sd(lifeExp))
     mnGDP   sdLife
1 7215.327 12.91711

or

gapminder %>% select(-c(country, continent, pop)) %>%
    summarize(mnGDP=mean(gdpPercap), sdLife=sd(lifeExp))
     mnGDP   sdLife
1 7215.327 12.91711

Helper functions

Here are important helper functions for dplyr::select

  1. select( contains(".")): select columns whose name contains a character string
  2. select( ends_with("Length")): select columns whose name ends in the specified string
  3. select( starts_with("Sepal")): select columns whose name starts with the specified string
  4. select( everything()): select all columns
  5. select( matches(".t.")): select the columns whose name matches a regular expression
  6. select( num_range("x",1:5)): select the columns named x1, x2, …, x5
  7. select( one_off("Species","Genus")): select columns whose names are in the specified group of names
  8. select( Sepal.Length:Petal.Width): Select all columns between Sepal.Length and Petal.Width, including those columns

4.5 Creating New Variables

The principal function to create new variables in a data frame is dplyr::mutate. Variations are mutate_each which applies a function to every column and transmute which drops the original columns.

The following statements compute the GPD as the product of per-capita GDP and population size and finds the top-5 countries by GDP in Asia in 2007:

gapminder %>%
    filter(continent == "Asia", year == 2007) %>%
    mutate(GDP = gdpPercap * pop) %>%
    select(country, year, gdpPercap, GDP) %>%
    slice_max(GDP,n=5)
     country year gdpPercap          GDP
1      China 2007  4959.115 6.539501e+12
2      Japan 2007 31656.068 4.035135e+12
3      India 2007  2452.210 2.722925e+12
4 Korea Rep. 2007 23348.140 1.145105e+12
5       Iran 2007 11605.714 8.060583e+11
Note

The variable GDP created in this operation is transient. It is instantiated for the duration of the pipeline and is not added to the gapminder data frame. If you want to add a new variable to an existing data frame you need to assign the result to a return object.

4.6 Combining Data Sets

The process of combining tables is based on joins, set operations, or bindings. A join uses the values in specific columns of one data set to match records with another data set. A set operation is a merging of columns without considering the values in the columns. Appending rows rows of one table to another table or columns of one table to another table are binding operations. What happens to columns that exist in one data set but not in the other during the append depends on the implementation. Similarly, what happens to columns that share the same name when tables are merged horizontally depends on the implementation.

To show the various join and set operations, let’s create three small data frames of cities and weather information.

capitals1 <- data.frame(city=c('Amsterdam','Berlin'),
                       country=c('NL','Germany'))

capitals2 <- data.frame(city=c('Amsterdam','Washington, D.C.'),
                       country=c('NL','U.S.A.'))

weather1 <- data.frame(city=c('Amsterdam','Seattle'),
                      degrees=c(10,8),
                      date=c("2022-10-14","2022-10-12"))

capitals1
       city country
1 Amsterdam      NL
2    Berlin Germany
capitals2 
              city country
1        Amsterdam      NL
2 Washington, D.C.  U.S.A.
weather1
       city degrees       date
1 Amsterdam      10 2022-10-14
2   Seattle       8 2022-10-12

Set Operations

We distinguish three set operations: the intersection of rows that appear in two data sets, the union of the rows, and the set difference of the rows. dplyr supports set operations with the following functions:

  • intersect(x, y): finds all rows in both x and y.

  • union(x, y): finds all rows in either x or y, excluding duplicates.

  • union_all(x, y): finds all rows in either x or y, including duplicates.

  • setdiff(x, y): finds all rows in x that are not in y.

  • symdiff(x, y): computes the symmetric difference, i.e. all rows in x that are not in y and all rows in y that are not in x.

  • setequal(x, y): returns TRUE if x and y contain the same rows (ignoring order).

Note that except for union_all the functions that return rows remove duplicates in x and y.

For the two data frames of capitals, here are the results of the various set operations.

dplyr::intersect(capitals1, capitals2)
       city country
1 Amsterdam      NL
dplyr::setdiff(capitals1, capitals2)
    city country
1 Berlin Germany
dplyr::union(capitals1, capitals2)
              city country
1        Amsterdam      NL
2           Berlin Germany
3 Washington, D.C.  U.S.A.
dplyr::union_all(capitals1, capitals2)
              city country
1        Amsterdam      NL
2           Berlin Germany
3        Amsterdam      NL
4 Washington, D.C.  U.S.A.
dplyr::symdiff(capitals1, capitals2)
              city country
1           Berlin Germany
2 Washington, D.C.  U.S.A.

Joins

Set operations combine or reduce rows of data (vertically). Join operations combine data sets horizontally. Joins typically are based on the values in columns of the data sets to find matches.

Joins are categorized as outer or inner joins depending on whether rows with matches are returned. An outer join returns rows that do not have any matches whereas the inner join returns only rows that get paired. The two data frames in a join are called the left and right sides of the relation and outer joins are further classified as

  • Left outer join: all rows from the left side of the relation appear at least once.

  • Right outer join: all rows from the right side of the relation appear at least once.

  • Full outer join: all rows from both sides of the relation appear at least once.

To demonstrate the joins in dplyr let’s set up some simple tables:

weather <- data.frame(name=c('San Francisco','San Francisco','Hayward'),
                      temp_lo=c(46,43,37),
                      temp_hi=c(50,57,54),
                      prcp=c(0.25,0.0,NA),
                      date=c('1994-11-27','1994-11-29','1994-11-29'))

cities <- data.frame(name=c('San Francisco'),
                     lat=c(-194.0),
                     lon=c(53.0))


weather
           name temp_lo temp_hi prcp       date
1 San Francisco      46      50 0.25 1994-11-27
2 San Francisco      43      57 0.00 1994-11-29
3       Hayward      37      54   NA 1994-11-29
cities
           name  lat lon
1 San Francisco -194  53

An inner join between the data frames on the columns that contain the city names will match the records for San Francisco:

dplyr::inner_join(weather,cities,by="name")
           name temp_lo temp_hi prcp       date  lat lon
1 San Francisco      46      50 0.25 1994-11-27 -194  53
2 San Francisco      43      57 0.00 1994-11-29 -194  53

Note that the values for lat and lon are repeated for every row in the weather table that matches the join in the relation. Because this is an inner join and the weather table had no matching row for city Hayward, this city does not appear in the join result. We can change that by modifying the type of join to a left outer join:

dplyr::left_join(weather,cities,by="name")
           name temp_lo temp_hi prcp       date  lat lon
1 San Francisco      46      50 0.25 1994-11-27 -194  53
2 San Francisco      43      57 0.00 1994-11-29 -194  53
3       Hayward      37      54   NA 1994-11-29   NA  NA

Because the join is an outer join, rows that do not have matches in the relation are returned. Because the outer join is a left join, every row on the left side of the relation is returned (at least once). If you change the left- and right-hand side of the relation you can achieve the same result by using a right outer join:

dplyr::right_join(cities,weather,by="name")
           name  lat lon temp_lo temp_hi prcp       date
1 San Francisco -194  53      46      50 0.25 1994-11-27
2 San Francisco -194  53      43      57 0.00 1994-11-29
3       Hayward   NA  NA      37      54   NA 1994-11-29

The left join retains all observations in the left data frame (first argument). The right join retains all observations in the right data frame (second argument).

Now let’s add another record to the cities data frame without a matching record in the weather data frame:

cities <- dplyr::bind_rows(cities,data.frame(name="New York",
                                             lat=40.7,
                                             lon=-73.9))
cities 
           name    lat   lon
1 San Francisco -194.0  53.0
2      New York   40.7 -73.9

A full outer join between the cities and weather data frames ensures that rows from both sides of the relation appear at least once:

dplyr::full_join(cities,weather,by="name")
           name    lat   lon temp_lo temp_hi prcp       date
1 San Francisco -194.0  53.0      46      50 0.25 1994-11-27
2 San Francisco -194.0  53.0      43      57 0.00 1994-11-29
3      New York   40.7 -73.9      NA      NA   NA       <NA>
4       Hayward     NA    NA      37      54   NA 1994-11-29

Bindings

For data scientists working with rectangular data frames in which observations have a natural order, merging data horizontally is a standard operation. Observations are matched by position and not according to the values in key columns. In the world of relational databases, such a merge is called a positional join and a somewhat unnatural operation because relational tables do not work from a natural ordering of the data, they are based on keys and indices.

When working with statistical data sets merging by position is common as data sets do not have keys. The positional join—or column binding—matches data frames row-by-row such that rows from both tables appear at least once:

dplyr::bind_cols(capitals1,weather1)
New names:
• `city` -> `city...1`
• `city` -> `city...3`
   city...1 country  city...3 degrees       date
1 Amsterdam      NL Amsterdam      10 2022-10-14
2    Berlin Germany   Seattle       8 2022-10-12

Note that both data frames contribute a city variable and these are renamed to resolve name collision.

A similar operation binding rows stacks one data frame on top of another. The result from dplyr::bind_rows contains all columns that appear in any of the inputs, unobserved combinations are set to NA.

dplyr::bind_rows(capitals1,capitals2)
              city country
1        Amsterdam      NL
2           Berlin Germany
3        Amsterdam      NL
4 Washington, D.C.  U.S.A.
dplyr::bind_rows(capitals1,weather1)
       city country degrees       date
1 Amsterdam      NL      NA       <NA>
2    Berlin Germany      NA       <NA>
3 Amsterdam    <NA>      10 2022-10-14
4   Seattle    <NA>       8 2022-10-12