install.packages("tidyverse")
4 Wrangling Data
4.1 Introduction
Wrangling data refers to the steps to organize the data in a structured format that is suitable for analytic processing. Typically, the result of data wrangling is data in a row-column layout where each analysis variable is in its own column and each observation is in its own row. Data often do not start out that way.
We saw a bit of data wrangling in the previous chapter (Section 3.3) when a nested JSON structure was flattened into a row-column format.
In addition to structuring the data, wrangling includes the following steps:
Cleaning involves removing or correcting inaccurate data, handling duplicates, and addressing anomalies that could impact the reliability of analyses. The focus is on enhancing data accuracy and reliability for analytic processing.
Enriching involves creating additional variables, incorporating external data, and combining the data with other data sources. Any new data sources you bring needs to be structured and cleaned as well.
Validation checks for inconsistencies and verifies data integrity. Your organization will have standards, for example how regions, customer information, names, ages, etc. are represented in data sets, and this step ensures that the standards are met.
4.2 tidyverse
In this chapter we concentrate on structuring, enriching, and combining data with the libraries in the tidyverse
. This is an “opinionated” collection of R
packages for data science, including dplyr
, ggplot2
, tibble
, tidyr
, purr
, stringr
, and readr
. dplyr
, tidyr
, and ggplot2
are arguably the most important packages in the collection (that is my “opinionated” view). For data wrangling we rely on dplyr
and tidyr
.
You can install the packages individually or grab all the tidyverse
packages with
A great cheat sheet for wrangling data in R
with dplyr
and tidyr
can be found here. This chapter draws on the contents of this cheat sheet.
Piping
The tidyverse
packages share a common philosophy; that makes it easy to use code across multiple packages and to combine them. An example of this common philosophy is piping. The following pipeline starts with the iris
data frame. The data is piped to the filter
function with the pipe operator %>%
. The result of the filter
operation is piped to the summarize
function to compute the average petal width of the plants with sepal length greater than 6:
library(tidyverse)
%>%
iris filter(Sepal.Length > 6) %>%
summarize(average=mean(Petal.Width))
average
1 1.847541
By default the argument on the left side of the pipe operator is passed as the first argument to the function on the right side. The filter
function call is really dplyr::filter(iris,Species=="virginica)
. You can also pass the argument on the left of the pipe to a different argument on the right side of the pipe by using a dot:
x %>% f(y)
is the same as f(x,y)
x %>% f(y, ., z)
is the same as f(y,x,z)
Piping not only shows analytic operations as a sequence of steps, but also reduces the amount of code and makes it generally more readable. Learning to write good pipelines.
Data wrangling with dplyr
and tidyr
can be organized into the following steps:
- Shaping the data
- Subsetting observations and/or variables
- Creating new variables
- Combining data sets
- Summarization
Summarizing data is discussed as a separate step in Chapter 6.
4.3 Shaping Data into Tabular Form
The goal of shaping the data is to change its structure into a tabular row-column form where each variable is in a separate column and each observation is in its own row.
Consider the classic airline passenger time series data from Box, jenkins, and Reinsel (1976), showing monthly totals of international airline passengers between 1994 and 1960 (in thousands).
AirPassengers
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1949 112 118 132 129 121 135 148 148 136 119 104 118
1950 115 126 141 135 125 149 170 170 158 133 114 140
1951 145 150 178 163 172 178 199 199 184 162 146 166
1952 171 180 193 181 183 218 230 242 209 191 172 194
1953 196 196 236 235 229 243 264 272 237 211 180 201
1954 204 188 235 227 234 264 302 293 259 229 203 229
1955 242 233 267 269 270 315 364 347 312 274 237 278
1956 284 277 317 313 318 374 413 405 355 306 271 306
1957 315 301 356 348 355 422 465 467 404 347 305 336
1958 340 318 362 348 363 435 491 505 404 359 310 337
1959 360 342 406 396 420 472 548 559 463 407 362 405
1960 417 391 419 461 472 535 622 606 508 461 390 432
The data are arranged very neatly, each year in a separate row, each month in a separate column. This is a great layout for time series analysis but maybe not the best layout for general analytic processing. The analysis variables are year, month, and the passenger count. Restructuring the AirPassengers
data into a data frame with variables for year
and count
is easy because this data set is stored as an R
time series.
str(AirPassengers)
Time-Series [1:144] from 1949 to 1961: 112 118 132 129 121 135 148 148 136 119 ...
<- data.frame(count=as.matrix(AirPassengers),
dat time=time(AirPassengers),
year=round(time(AirPassengers),0))
head(dat)
count time year
1 112 1949.000 1949
2 118 1949.083 1949
3 132 1949.167 1949
4 129 1949.250 1949
5 121 1949.333 1949
6 135 1949.417 1949
Gapminder Data
In general, the structure of a non-tabular format is not known to R
and must be wrangled by the user. Consider the famous Gapminder dataset. The Gapminder Foundation is a Swedish non-profit that promotes the United Nations sustainability goals through use of statistics. The Gapminder data set tracks economic and social indicators like life expectancy and the GDP per capita of countries over time.
A wide format of the Gapminder data is available on GitHub here. We are following the material in this excellent online resource in wrangling the data in the wide format.
<- read.csv(file="data/gapminder_wide.csv")
gap_wide str(gap_wide)
'data.frame': 142 obs. of 38 variables:
$ continent : chr "Africa" "Africa" "Africa" "Africa" ...
$ country : chr "Algeria" "Angola" "Benin" "Botswana" ...
$ gdpPercap_1952: num 2449 3521 1063 851 543 ...
$ gdpPercap_1957: num 3014 3828 960 918 617 ...
$ gdpPercap_1962: num 2551 4269 949 984 723 ...
$ gdpPercap_1967: num 3247 5523 1036 1215 795 ...
$ gdpPercap_1972: num 4183 5473 1086 2264 855 ...
$ gdpPercap_1977: num 4910 3009 1029 3215 743 ...
$ gdpPercap_1982: num 5745 2757 1278 4551 807 ...
$ gdpPercap_1987: num 5681 2430 1226 6206 912 ...
$ gdpPercap_1992: num 5023 2628 1191 7954 932 ...
$ gdpPercap_1997: num 4797 2277 1233 8647 946 ...
$ gdpPercap_2002: num 5288 2773 1373 11004 1038 ...
$ gdpPercap_2007: num 6223 4797 1441 12570 1217 ...
$ lifeExp_1952 : num 43.1 30 38.2 47.6 32 ...
$ lifeExp_1957 : num 45.7 32 40.4 49.6 34.9 ...
$ lifeExp_1962 : num 48.3 34 42.6 51.5 37.8 ...
$ lifeExp_1967 : num 51.4 36 44.9 53.3 40.7 ...
$ lifeExp_1972 : num 54.5 37.9 47 56 43.6 ...
$ lifeExp_1977 : num 58 39.5 49.2 59.3 46.1 ...
$ lifeExp_1982 : num 61.4 39.9 50.9 61.5 48.1 ...
$ lifeExp_1987 : num 65.8 39.9 52.3 63.6 49.6 ...
$ lifeExp_1992 : num 67.7 40.6 53.9 62.7 50.3 ...
$ lifeExp_1997 : num 69.2 41 54.8 52.6 50.3 ...
$ lifeExp_2002 : num 71 41 54.4 46.6 50.6 ...
$ lifeExp_2007 : num 72.3 42.7 56.7 50.7 52.3 ...
$ pop_1952 : num 9279525 4232095 1738315 442308 4469979 ...
$ pop_1957 : num 10270856 4561361 1925173 474639 4713416 ...
$ pop_1962 : num 11000948 4826015 2151895 512764 4919632 ...
$ pop_1967 : num 12760499 5247469 2427334 553541 5127935 ...
$ pop_1972 : num 14760787 5894858 2761407 619351 5433886 ...
$ pop_1977 : num 17152804 6162675 3168267 781472 5889574 ...
$ pop_1982 : num 20033753 7016384 3641603 970347 6634596 ...
$ pop_1987 : num 23254956 7874230 4243788 1151184 7586551 ...
$ pop_1992 : num 26298373 8735988 4981671 1342614 8878303 ...
$ pop_1997 : num 29072015 9875024 6066080 1536536 10352843 ...
$ pop_2002 : int 31287142 10866106 7026113 1630347 12251209 7021078 15929988 4048013 8835739 614382 ...
$ pop_2007 : int 33333216 12420476 8078314 1639131 14326203 8390505 17696293 4369038 10238807 710960 ...
The data are stored in wide format, annual values for the variables GDP, life expectancy, and population appear in separate columns. This is not unlike how the AirPassengers
data is displayed, but gap_wide
is not a time series object. The desired (“tidy”) way of structuring the data, where each variable is a column and each observation is a row is a tabular structure with variables
- Continent
- Country
- Year
- GDP
- Life Expectancy
- Population
For each combination of continent and country there will be 12 observations, one for each year.
Gather columns
To move from wide to the desired long format, we use the dplyr::gather
function. To do the opposite, moving from long to wide format, use the dplyr::spread
function. To restructure the Gapminder data set from wide format into the desired format takes several steps.
The first step is to gather the columns that contain the values for GDP, life expectancy, and population into separate rows. Those are all columns in gap_wide
except continent
and country
, so we can exclude those from the gathering operation with -c(continent, country)
.
<- gap_wide %>%
gap_step1 gather(key=vartype_year,
value=values,
-c(continent, country))
head(gap_step1)
continent country vartype_year values
1 Africa Algeria gdpPercap_1952 2449.0082
2 Africa Angola gdpPercap_1952 3520.6103
3 Africa Benin gdpPercap_1952 1062.7522
4 Africa Botswana gdpPercap_1952 851.2411
5 Africa Burkina Faso gdpPercap_1952 543.2552
6 Africa Burundi gdpPercap_1952 339.2965
tail(gap_step1)
continent country vartype_year values
5107 Europe Sweden pop_2007 9031088
5108 Europe Switzerland pop_2007 7554661
5109 Europe Turkey pop_2007 71158647
5110 Europe United Kingdom pop_2007 60776238
5111 Oceania Australia pop_2007 20434176
5112 Oceania New Zealand pop_2007 4115771
The resulting data frame has 5112 rows and 4 rows. How do the number of rows come about?
- The original data has 142 rows
- Three variables (
gdpPercap
, lifeExp,
pop`) are measured for 12 years - There are 142 combinations of
continent
and country(the number of observations in
gap_wide`).
Separating character variables
The gather
function turns all variables into key-value pairs; the key is the name of the variable. The column that contains the names of the variables after the gathering is called vartype_year
. In the next step dplyr::separate
is called to split the character column vartype_year
into two columns, one for the variable type and one for the year. The convert=TRUE
option attempts to convert the data type of the new columns from character to numeric data—this fails for the vartype
column but succeeds for the year
column.
<- gap_wide %>%
gap_step2 gather(key =vartype_year,
value=values,
-c(continent, country)) %>%
separate(vartype_year,
into =c('vartype','year'),
sep ="_",
convert=TRUE)
str(gap_step2)
'data.frame': 5112 obs. of 5 variables:
$ continent: chr "Africa" "Africa" "Africa" "Africa" ...
$ country : chr "Algeria" "Angola" "Benin" "Botswana" ...
$ vartype : chr "gdpPercap" "gdpPercap" "gdpPercap" "gdpPercap" ...
$ year : int 1952 1952 1952 1952 1952 1952 1952 1952 1952 1952 ...
$ values : num 2449 3521 1063 851 543 ...
head(gap_step2)
continent country vartype year values
1 Africa Algeria gdpPercap 1952 2449.0082
2 Africa Angola gdpPercap 1952 3520.6103
3 Africa Benin gdpPercap 1952 1062.7522
4 Africa Botswana gdpPercap 1952 851.2411
5 Africa Burkina Faso gdpPercap 1952 543.2552
6 Africa Burundi gdpPercap 1952 339.2965
tail(gap_step2)
continent country vartype year values
5107 Europe Sweden pop 2007 9031088
5108 Europe Switzerland pop 2007 7554661
5109 Europe Turkey pop 2007 71158647
5110 Europe United Kingdom pop 2007 60776238
5111 Oceania Australia pop 2007 20434176
5112 Oceania New Zealand pop 2007 4115771
Spreading rows into columns
We are almost there, but not quite. We now have columns for continent, country, and year, but the values
column contains values of different types: 1,704 observations for GDP, 1,704 observations for life expectancy, and 1,704 observations for population. To create separate columns from the rows we can reverse the gather operation with the spread
function—it splits key-value pairs across multiple columns. The entries in the vartype
column are used by the spreading operation as the names of the new columns.
<-
gapminder %>%
gap_wide gather(key =vartype_year,
value=values,
-c(continent, country)) %>%
separate(vartype_year,
into =c('vartype','year'),
sep ="_",
convert=TRUE) %>%
spread(vartype, values)
head(gapminder)
continent country year gdpPercap lifeExp pop
1 Africa Algeria 1952 2449.008 43.077 9279525
2 Africa Algeria 1957 3013.976 45.685 10270856
3 Africa Algeria 1962 2550.817 48.303 11000948
4 Africa Algeria 1967 3246.992 51.407 12760499
5 Africa Algeria 1972 4182.664 54.518 14760787
6 Africa Algeria 1977 4910.417 58.014 17152804
tail(gapminder)
continent country year gdpPercap lifeExp pop
1699 Oceania New Zealand 1982 17632.41 73.840 3210650
1700 Oceania New Zealand 1987 19007.19 74.320 3317166
1701 Oceania New Zealand 1992 18363.32 76.330 3437674
1702 Oceania New Zealand 1997 21050.41 77.550 3676187
1703 Oceania New Zealand 2002 23189.80 79.110 3908037
1704 Oceania New Zealand 2007 25185.01 80.204 4115771
In summary, we used gather to create one key-value pair of the variable_year
columns and values in step 1, separated out the year
in step 2, and spread the remaining variable types back out into columns in step 3.
4.4 Subsetting
A subset operation reduces the number of observations (rows) or variables (columns) of data set. While filter
is the most important operation to subset rows, there are a number of other functions in dplyr
that reduce the rows. On the other hand, to subset columns with dplyr
there is only one function, dplyr::select
.
Subsetting Rows
The following statements create a data frame of dimension 100 x 2, containing two columns of integers randomly selected from 1–10 with replacement:
set.seed(76)
<- data.frame(
dat x = sample(10, 100, rep = TRUE),
y = sample(10, 100, rep = TRUE)) %>%
glimpse()
Rows: 100
Columns: 2
$ x <int> 5, 1, 2, 10, 7, 5, 6, 10, 8, 6, 4, 6, 3, 1, 8, 5, 7, 6, 4, 1, 3, 10,…
$ y <int> 3, 5, 9, 8, 7, 10, 4, 1, 10, 6, 9, 6, 2, 9, 3, 8, 9, 7, 4, 3, 2, 3, …
filter
filter
extracts rows that meet a logical condition. The following statement selects the rows for which \(x \in (1,5)\) and \(y \in (3,8)\).
%>% filter(x %in% c(1,5) & y %in% c(3,8)) dat
x y
1 5 3
2 5 8
3 1 3
4 5 8
5 5 8
6 5 3
You can also list the conditions that are combined with logical “and”, separated with commas:
%>% filter(x %in% c(1,5), y %in% c(3,8)) dat
x y
1 5 3
2 5 8
3 1 3
4 5 8
5 5 8
6 5 3
The following statement extracts the rows where the value of y
exceeds 3.5 times its standard deviation:
%>% filter(y > 3.5*sd(y)) dat
x y
1 5 10
2 8 10
3 7 10
4 1 10
5 8 10
6 1 10
7 9 10
8 9 10
9 6 10
10 7 10
11 4 10
12 5 10
Another subsetting operation is to remove duplicate observations from a data set with distinct
. When applied to a subset of the columns, distinct
returns the combinations of their unique values in the data.
distinct
%>% distinct(x) dat
x
1 5
2 1
3 2
4 10
5 7
6 6
7 8
8 4
9 3
10 9
If you specify .keep_all=TRUE
, all other variables in the data set are kept as well. If multiple combinations occur, the function retains the first occurrence.
%>% distinct(x, .keep_all=TRUE) dat
x y
1 5 3
2 1 5
3 2 9
4 10 8
5 7 7
6 6 4
7 8 10
8 4 9
9 3 2
10 9 3
To remove all duplicates in a data frame, simply call distinct on the data frame.
<- dat %>% distinct() %>% summarize(count=n())
nuniq nuniq
count
1 65
There are 65 unique rows of data in the dat
data frame.
Other subsetting functions
Other functions subsetting rows in dplyr
are
dplyr::sample_frac
: randomly select a proportion of the rowsdplyr::sample_n
: randomly select a fixed number of rowsdplyr::slice
: select rows by position, for exampleslice(dat,4:10)
extracts rows 4–10dplyr::slice_head
: selects the first rowsdplyr::slice_tail
: selects the last rowsdplyr::slice_min
: select rows with the smallest valuesdplyr::slice_max
: select rows with the largest values
For example, the next statement selects the observation with the five largest values for the Sepal.Width
variable in the iris
data set.
%>% slice_max(Sepal.Width, n=5) iris
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.7 4.4 1.5 0.4 setosa
2 5.5 4.2 1.4 0.2 setosa
3 5.2 4.1 1.5 0.1 setosa
4 5.8 4.0 1.2 0.2 setosa
5 5.4 3.9 1.7 0.4 setosa
6 5.4 3.9 1.3 0.4 setosa
Subsetting (selecting) Columns
select
To subset columns there is only one statement in dplyr
, the select
statement. However, it is very versatile because of its many helper functions.
The basic usage is to list the column names being selected:
%>% select(gdpPercap, lifeExp) %>%
gapminder summarize(mnGDP=mean(gdpPercap), sdLife=sd(lifeExp))
mnGDP sdLife
1 7215.327 12.91711
You can also specify an exclusion with a negative selection
%>% select(-country, -continent, -pop) %>%
gapminder summarize(mnGDP=mean(gdpPercap), sdLife=sd(lifeExp))
mnGDP sdLife
1 7215.327 12.91711
or
%>% select(-c(country, continent, pop)) %>%
gapminder summarize(mnGDP=mean(gdpPercap), sdLife=sd(lifeExp))
mnGDP sdLife
1 7215.327 12.91711
Helper functions
Here are important helper functions for dplyr::select
select( contains("."))
: select columns whose name contains a character stringselect( ends_with("Length"))
: select columns whose name ends in the specified stringselect( starts_with("Sepal"))
: select columns whose name starts with the specified stringselect( everything())
: select all columnsselect( matches(".t."))
: select the columns whose name matches a regular expressionselect( num_range("x",1:5))
: select the columns namedx1
,x2
, …,x5
select( one_off("Species","Genus"))
: select columns whose names are in the specified group of namesselect( Sepal.Length:Petal.Width)
: Select all columns betweenSepal.Length
andPetal.Width
, including those columns
4.5 Creating New Variables
The principal function to create new variables in a data frame is dplyr::mutate
. Variations are mutate_each
which applies a function to every column and transmute
which drops the original columns.
The following statements compute the GPD as the product of per-capita GDP and population size and finds the top-5 countries by GDP in Asia in 2007:
%>%
gapminder filter(continent == "Asia", year == 2007) %>%
mutate(GDP = gdpPercap * pop) %>%
select(country, year, gdpPercap, GDP) %>%
slice_max(GDP,n=5)
country year gdpPercap GDP
1 China 2007 4959.115 6.539501e+12
2 Japan 2007 31656.068 4.035135e+12
3 India 2007 2452.210 2.722925e+12
4 Korea Rep. 2007 23348.140 1.145105e+12
5 Iran 2007 11605.714 8.060583e+11
The variable GDP
created in this operation is transient. It is instantiated for the duration of the pipeline and is not added to the gapminder
data frame. If you want to add a new variable to an existing data frame you need to assign the result to a return object.
4.6 Combining Data Sets
The process of combining tables is based on joins, set operations, or bindings. A join uses the values in specific columns of one data set to match records with another data set. A set operation is a merging of columns without considering the values in the columns. Appending rows rows of one table to another table or columns of one table to another table are binding operations. What happens to columns that exist in one data set but not in the other during the append depends on the implementation. Similarly, what happens to columns that share the same name when tables are merged horizontally depends on the implementation.
To show the various join and set operations, let’s create three small data frames of cities and weather information.
<- data.frame(city=c('Amsterdam','Berlin'),
capitals1 country=c('NL','Germany'))
<- data.frame(city=c('Amsterdam','Washington, D.C.'),
capitals2 country=c('NL','U.S.A.'))
<- data.frame(city=c('Amsterdam','Seattle'),
weather1 degrees=c(10,8),
date=c("2022-10-14","2022-10-12"))
capitals1
city country
1 Amsterdam NL
2 Berlin Germany
capitals2
city country
1 Amsterdam NL
2 Washington, D.C. U.S.A.
weather1
city degrees date
1 Amsterdam 10 2022-10-14
2 Seattle 8 2022-10-12
Set Operations
We distinguish three set operations: the intersection of rows that appear in two data sets, the union of the rows, and the set difference of the rows. dplyr
supports set operations with the following functions:
intersect(x, y)
: finds all rows in bothx
andy
.union(x, y)
: finds all rows in eitherx
ory
, excluding duplicates.union_all(x, y)
: finds all rows in eitherx
ory
, including duplicates.setdiff(x, y)
: finds all rows inx
that are not iny
.symdiff(x, y)
: computes the symmetric difference, i.e. all rows inx
that are not iny
and all rows iny
that are not in x.setequal(x, y)
: returns TRUE ifx
andy
contain the same rows (ignoring order).
Note that except for union_all
the functions that return rows remove duplicates in x
and y
.
For the two data frames of capitals, here are the results of the various set operations.
::intersect(capitals1, capitals2) dplyr
city country
1 Amsterdam NL
::setdiff(capitals1, capitals2) dplyr
city country
1 Berlin Germany
::union(capitals1, capitals2) dplyr
city country
1 Amsterdam NL
2 Berlin Germany
3 Washington, D.C. U.S.A.
::union_all(capitals1, capitals2) dplyr
city country
1 Amsterdam NL
2 Berlin Germany
3 Amsterdam NL
4 Washington, D.C. U.S.A.
::symdiff(capitals1, capitals2) dplyr
city country
1 Berlin Germany
2 Washington, D.C. U.S.A.
Joins
Set operations combine or reduce rows of data (vertically). Join operations combine data sets horizontally. Joins typically are based on the values in columns of the data sets to find matches.
Joins are categorized as outer or inner joins depending on whether rows with matches are returned. An outer join returns rows that do not have any matches whereas the inner join returns only rows that get paired. The two data frames in a join are called the left and right sides of the relation and outer joins are further classified as
Left outer join: all rows from the left side of the relation appear at least once.
Right outer join: all rows from the right side of the relation appear at least once.
Full outer join: all rows from both sides of the relation appear at least once.
To demonstrate the joins in dplyr
let’s set up some simple tables:
<- data.frame(name=c('San Francisco','San Francisco','Hayward'),
weather temp_lo=c(46,43,37),
temp_hi=c(50,57,54),
prcp=c(0.25,0.0,NA),
date=c('1994-11-27','1994-11-29','1994-11-29'))
<- data.frame(name=c('San Francisco'),
cities lat=c(-194.0),
lon=c(53.0))
weather
name temp_lo temp_hi prcp date
1 San Francisco 46 50 0.25 1994-11-27
2 San Francisco 43 57 0.00 1994-11-29
3 Hayward 37 54 NA 1994-11-29
cities
name lat lon
1 San Francisco -194 53
An inner join between the data frames on the columns that contain the city names will match the records for San Francisco:
::inner_join(weather,cities,by="name") dplyr
name temp_lo temp_hi prcp date lat lon
1 San Francisco 46 50 0.25 1994-11-27 -194 53
2 San Francisco 43 57 0.00 1994-11-29 -194 53
Note that the values for lat
and lon
are repeated for every row in the weather table that matches the join in the relation. Because this is an inner join and the weather table had no matching row for city Hayward, this city does not appear in the join result. We can change that by modifying the type of join to a left outer join:
::left_join(weather,cities,by="name") dplyr
name temp_lo temp_hi prcp date lat lon
1 San Francisco 46 50 0.25 1994-11-27 -194 53
2 San Francisco 43 57 0.00 1994-11-29 -194 53
3 Hayward 37 54 NA 1994-11-29 NA NA
Because the join is an outer join, rows that do not have matches in the relation are returned. Because the outer join is a left join, every row on the left side of the relation is returned (at least once). If you change the left- and right-hand side of the relation you can achieve the same result by using a right outer join:
::right_join(cities,weather,by="name") dplyr
name lat lon temp_lo temp_hi prcp date
1 San Francisco -194 53 46 50 0.25 1994-11-27
2 San Francisco -194 53 43 57 0.00 1994-11-29
3 Hayward NA NA 37 54 NA 1994-11-29
The left join retains all observations in the left data frame (first argument). The right join retains all observations in the right data frame (second argument).
Now let’s add another record to the cities
data frame without a matching record in the weather
data frame:
<- dplyr::bind_rows(cities,data.frame(name="New York",
cities lat=40.7,
lon=-73.9))
cities
name lat lon
1 San Francisco -194.0 53.0
2 New York 40.7 -73.9
A full outer join between the cities
and weather
data frames ensures that rows from both sides of the relation appear at least once:
::full_join(cities,weather,by="name") dplyr
name lat lon temp_lo temp_hi prcp date
1 San Francisco -194.0 53.0 46 50 0.25 1994-11-27
2 San Francisco -194.0 53.0 43 57 0.00 1994-11-29
3 New York 40.7 -73.9 NA NA NA <NA>
4 Hayward NA NA 37 54 NA 1994-11-29
Bindings
For data scientists working with rectangular data frames in which observations have a natural order, merging data horizontally is a standard operation. Observations are matched by position and not according to the values in key columns. In the world of relational databases, such a merge is called a positional join and a somewhat unnatural operation because relational tables do not work from a natural ordering of the data, they are based on keys and indices.
When working with statistical data sets merging by position is common as data sets do not have keys. The positional join—or column binding—matches data frames row-by-row such that rows from both tables appear at least once:
::bind_cols(capitals1,weather1) dplyr
New names:
• `city` -> `city...1`
• `city` -> `city...3`
city...1 country city...3 degrees date
1 Amsterdam NL Amsterdam 10 2022-10-14
2 Berlin Germany Seattle 8 2022-10-12
Note that both data frames contribute a city
variable and these are renamed to resolve name collision.
A similar operation binding rows stacks one data frame on top of another. The result from dplyr::bind_rows
contains all columns that appear in any of the inputs, unobserved combinations are set to NA
.
::bind_rows(capitals1,capitals2) dplyr
city country
1 Amsterdam NL
2 Berlin Germany
3 Amsterdam NL
4 Washington, D.C. U.S.A.
::bind_rows(capitals1,weather1) dplyr
city country degrees date
1 Amsterdam NL NA <NA>
2 Berlin Germany NA <NA>
3 Amsterdam <NA> 10 2022-10-14
4 Seattle <NA> 8 2022-10-12