install.packages("tidyverse")4 Wrangling Data
4.1 Introduction
Wrangling data refers to the steps to organize the data in a structured format that is suitable for analytic processing. Typically, the result of data wrangling is data in a row-column layout where each analysis variable is in its own column and each observation is in its own row. Data often do not start out that way.
We saw a bit of data wrangling in the previous chapter (Section 3.3) when a nested JSON structure was flattened into a row-column format.
In addition to structuring the data, wrangling includes the following steps:
Cleaning involves removing or correcting inaccurate data, handling duplicates, and addressing anomalies that could impact the reliability of analyses. The focus is on enhancing data accuracy and reliability for analytic processing.
Enriching involves creating additional variables, incorporating external data, and combining the data with other data sources. Any new data sources you bring needs to be structured and cleaned as well.
Validation checks for inconsistencies and verifies data integrity. Your organization will have standards, for example how regions, customer information, names, ages, etc. are represented in data sets, and this step ensures that the standards are met.
4.2 tidyverse
In this chapter we concentrate on structuring, enriching, and combining data with the libraries in the tidyverse. This is an “opinionated” collection of R packages for data science, including dplyr, ggplot2, tibble, tidyr, purr, stringr, and readr. dplyr, tidyr, and ggplot2 are arguably the most important packages in the collection (that is my “opinionated” view). For data wrangling we rely on dplyr and tidyr.
You can install the packages individually or grab all the tidyverse packages with
A great cheat sheet for wrangling data in R with dplyr and tidyr can be found here. This chapter draws on the contents of this cheat sheet.
Piping
The tidyverse packages share a common philosophy; that makes it easy to use code across multiple packages and to combine them. An example of this common philosophy is piping. The following pipeline starts with the iris data frame. The data is piped to the filter function with the pipe operator %>%. The result of the filter operation is piped to the summarize function to compute the average petal width of the plants with sepal length greater than 6:
library(tidyverse)
iris %>%
filter(Sepal.Length > 6) %>%
summarize(average=mean(Petal.Width)) average
1 1.847541
By default the argument on the left side of the pipe operator is passed as the first argument to the function on the right side. The filter function call is really dplyr::filter(iris,Species=="virginica). You can also pass the argument on the left of the pipe to a different argument on the right side of the pipe by using a dot:
x %>% f(y) is the same as f(x,y)
x %>% f(y, ., z) is the same as f(y,x,z)
Piping not only shows analytic operations as a sequence of steps, but also reduces the amount of code and makes it generally more readable. Learning to write good pipelines.
Data wrangling with dplyr and tidyr can be organized into the following steps:
- Shaping the data
- Subsetting observations and/or variables
- Creating new variables
- Combining data sets
- Summarization
Summarizing data is discussed as a separate step in Chapter 6.
4.3 Shaping Data into Tabular Form
The goal of shaping the data is to change its structure into a tabular row-column form where each variable is in a separate column and each observation is in its own row.
Consider the classic airline passenger time series data from Box, jenkins, and Reinsel (1976), showing monthly totals of international airline passengers between 1994 and 1960 (in thousands).
AirPassengers Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1949 112 118 132 129 121 135 148 148 136 119 104 118
1950 115 126 141 135 125 149 170 170 158 133 114 140
1951 145 150 178 163 172 178 199 199 184 162 146 166
1952 171 180 193 181 183 218 230 242 209 191 172 194
1953 196 196 236 235 229 243 264 272 237 211 180 201
1954 204 188 235 227 234 264 302 293 259 229 203 229
1955 242 233 267 269 270 315 364 347 312 274 237 278
1956 284 277 317 313 318 374 413 405 355 306 271 306
1957 315 301 356 348 355 422 465 467 404 347 305 336
1958 340 318 362 348 363 435 491 505 404 359 310 337
1959 360 342 406 396 420 472 548 559 463 407 362 405
1960 417 391 419 461 472 535 622 606 508 461 390 432
The data are arranged very neatly, each year in a separate row, each month in a separate column. This is a great layout for time series analysis but maybe not the best layout for general analytic processing. The analysis variables are year, month, and the passenger count. Restructuring the AirPassengers data into a data frame with variables for year and count is easy because this data set is stored as an R time series.
str(AirPassengers) Time-Series [1:144] from 1949 to 1961: 112 118 132 129 121 135 148 148 136 119 ...
dat <- data.frame(count=as.matrix(AirPassengers),
time=time(AirPassengers),
year=round(time(AirPassengers),0))
head(dat) count time year
1 112 1949.000 1949
2 118 1949.083 1949
3 132 1949.167 1949
4 129 1949.250 1949
5 121 1949.333 1949
6 135 1949.417 1949
Gapminder Data
In general, the structure of a non-tabular format is not known to R and must be wrangled by the user. Consider the famous Gapminder dataset. The Gapminder Foundation is a Swedish non-profit that promotes the United Nations sustainability goals through use of statistics. The Gapminder data set tracks economic and social indicators like life expectancy and the GDP per capita of countries over time.
A wide format of the Gapminder data is available on GitHub here. We are following the material in this excellent online resource in wrangling the data in the wide format.
gap_wide <- read.csv(file="data/gapminder_wide.csv")
str(gap_wide)'data.frame': 142 obs. of 38 variables:
$ continent : chr "Africa" "Africa" "Africa" "Africa" ...
$ country : chr "Algeria" "Angola" "Benin" "Botswana" ...
$ gdpPercap_1952: num 2449 3521 1063 851 543 ...
$ gdpPercap_1957: num 3014 3828 960 918 617 ...
$ gdpPercap_1962: num 2551 4269 949 984 723 ...
$ gdpPercap_1967: num 3247 5523 1036 1215 795 ...
$ gdpPercap_1972: num 4183 5473 1086 2264 855 ...
$ gdpPercap_1977: num 4910 3009 1029 3215 743 ...
$ gdpPercap_1982: num 5745 2757 1278 4551 807 ...
$ gdpPercap_1987: num 5681 2430 1226 6206 912 ...
$ gdpPercap_1992: num 5023 2628 1191 7954 932 ...
$ gdpPercap_1997: num 4797 2277 1233 8647 946 ...
$ gdpPercap_2002: num 5288 2773 1373 11004 1038 ...
$ gdpPercap_2007: num 6223 4797 1441 12570 1217 ...
$ lifeExp_1952 : num 43.1 30 38.2 47.6 32 ...
$ lifeExp_1957 : num 45.7 32 40.4 49.6 34.9 ...
$ lifeExp_1962 : num 48.3 34 42.6 51.5 37.8 ...
$ lifeExp_1967 : num 51.4 36 44.9 53.3 40.7 ...
$ lifeExp_1972 : num 54.5 37.9 47 56 43.6 ...
$ lifeExp_1977 : num 58 39.5 49.2 59.3 46.1 ...
$ lifeExp_1982 : num 61.4 39.9 50.9 61.5 48.1 ...
$ lifeExp_1987 : num 65.8 39.9 52.3 63.6 49.6 ...
$ lifeExp_1992 : num 67.7 40.6 53.9 62.7 50.3 ...
$ lifeExp_1997 : num 69.2 41 54.8 52.6 50.3 ...
$ lifeExp_2002 : num 71 41 54.4 46.6 50.6 ...
$ lifeExp_2007 : num 72.3 42.7 56.7 50.7 52.3 ...
$ pop_1952 : num 9279525 4232095 1738315 442308 4469979 ...
$ pop_1957 : num 10270856 4561361 1925173 474639 4713416 ...
$ pop_1962 : num 11000948 4826015 2151895 512764 4919632 ...
$ pop_1967 : num 12760499 5247469 2427334 553541 5127935 ...
$ pop_1972 : num 14760787 5894858 2761407 619351 5433886 ...
$ pop_1977 : num 17152804 6162675 3168267 781472 5889574 ...
$ pop_1982 : num 20033753 7016384 3641603 970347 6634596 ...
$ pop_1987 : num 23254956 7874230 4243788 1151184 7586551 ...
$ pop_1992 : num 26298373 8735988 4981671 1342614 8878303 ...
$ pop_1997 : num 29072015 9875024 6066080 1536536 10352843 ...
$ pop_2002 : int 31287142 10866106 7026113 1630347 12251209 7021078 15929988 4048013 8835739 614382 ...
$ pop_2007 : int 33333216 12420476 8078314 1639131 14326203 8390505 17696293 4369038 10238807 710960 ...
The data are stored in wide format, annual values for the variables GDP, life expectancy, and population appear in separate columns. This is not unlike how the AirPassengers data is displayed, but gap_wide is not a time series object. The desired (“tidy”) way of structuring the data, where each variable is a column and each observation is a row is a tabular structure with variables
- Continent
- Country
- Year
- GDP
- Life Expectancy
- Population
For each combination of continent and country there will be 12 observations, one for each year.
Gather columns
To move from wide to the desired long format, we use the dplyr::gather function. To do the opposite, moving from long to wide format, use the dplyr::spread function. To restructure the Gapminder data set from wide format into the desired format takes several steps.
The first step is to gather the columns that contain the values for GDP, life expectancy, and population into separate rows. Those are all columns in gap_wide except continent and country, so we can exclude those from the gathering operation with -c(continent, country).
gap_step1 <- gap_wide %>%
gather(key=vartype_year,
value=values,
-c(continent, country))
head(gap_step1) continent country vartype_year values
1 Africa Algeria gdpPercap_1952 2449.0082
2 Africa Angola gdpPercap_1952 3520.6103
3 Africa Benin gdpPercap_1952 1062.7522
4 Africa Botswana gdpPercap_1952 851.2411
5 Africa Burkina Faso gdpPercap_1952 543.2552
6 Africa Burundi gdpPercap_1952 339.2965
tail(gap_step1) continent country vartype_year values
5107 Europe Sweden pop_2007 9031088
5108 Europe Switzerland pop_2007 7554661
5109 Europe Turkey pop_2007 71158647
5110 Europe United Kingdom pop_2007 60776238
5111 Oceania Australia pop_2007 20434176
5112 Oceania New Zealand pop_2007 4115771
The resulting data frame has 5112 rows and 4 rows. How do the number of rows come about?
- The original data has 142 rows
- Three variables (
gdpPercap, lifeExp,pop`) are measured for 12 years - There are 142 combinations of
continentand country(the number of observations ingap_wide`).
Separating character variables
The gather function turns all variables into key-value pairs; the key is the name of the variable. The column that contains the names of the variables after the gathering is called vartype_year. In the next step dplyr::separate is called to split the character column vartype_year into two columns, one for the variable type and one for the year. The convert=TRUE option attempts to convert the data type of the new columns from character to numeric data—this fails for the vartype column but succeeds for the year column.
gap_step2 <- gap_wide %>%
gather(key =vartype_year,
value=values,
-c(continent, country)) %>%
separate(vartype_year,
into =c('vartype','year'),
sep ="_",
convert=TRUE)
str(gap_step2)'data.frame': 5112 obs. of 5 variables:
$ continent: chr "Africa" "Africa" "Africa" "Africa" ...
$ country : chr "Algeria" "Angola" "Benin" "Botswana" ...
$ vartype : chr "gdpPercap" "gdpPercap" "gdpPercap" "gdpPercap" ...
$ year : int 1952 1952 1952 1952 1952 1952 1952 1952 1952 1952 ...
$ values : num 2449 3521 1063 851 543 ...
head(gap_step2) continent country vartype year values
1 Africa Algeria gdpPercap 1952 2449.0082
2 Africa Angola gdpPercap 1952 3520.6103
3 Africa Benin gdpPercap 1952 1062.7522
4 Africa Botswana gdpPercap 1952 851.2411
5 Africa Burkina Faso gdpPercap 1952 543.2552
6 Africa Burundi gdpPercap 1952 339.2965
tail(gap_step2) continent country vartype year values
5107 Europe Sweden pop 2007 9031088
5108 Europe Switzerland pop 2007 7554661
5109 Europe Turkey pop 2007 71158647
5110 Europe United Kingdom pop 2007 60776238
5111 Oceania Australia pop 2007 20434176
5112 Oceania New Zealand pop 2007 4115771
Spreading rows into columns
We are almost there, but not quite. We now have columns for continent, country, and year, but the values column contains values of different types: 1,704 observations for GDP, 1,704 observations for life expectancy, and 1,704 observations for population. To create separate columns from the rows we can reverse the gather operation with the spread function—it splits key-value pairs across multiple columns. The entries in the vartype column are used by the spreading operation as the names of the new columns.
gapminder <-
gap_wide %>%
gather(key =vartype_year,
value=values,
-c(continent, country)) %>%
separate(vartype_year,
into =c('vartype','year'),
sep ="_",
convert=TRUE) %>%
spread(vartype, values)
head(gapminder) continent country year gdpPercap lifeExp pop
1 Africa Algeria 1952 2449.008 43.077 9279525
2 Africa Algeria 1957 3013.976 45.685 10270856
3 Africa Algeria 1962 2550.817 48.303 11000948
4 Africa Algeria 1967 3246.992 51.407 12760499
5 Africa Algeria 1972 4182.664 54.518 14760787
6 Africa Algeria 1977 4910.417 58.014 17152804
tail(gapminder) continent country year gdpPercap lifeExp pop
1699 Oceania New Zealand 1982 17632.41 73.840 3210650
1700 Oceania New Zealand 1987 19007.19 74.320 3317166
1701 Oceania New Zealand 1992 18363.32 76.330 3437674
1702 Oceania New Zealand 1997 21050.41 77.550 3676187
1703 Oceania New Zealand 2002 23189.80 79.110 3908037
1704 Oceania New Zealand 2007 25185.01 80.204 4115771
In summary, we used gather to create one key-value pair of the variable_year columns and values in step 1, separated out the year in step 2, and spread the remaining variable types back out into columns in step 3.
4.4 Subsetting
A subset operation reduces the number of observations (rows) or variables (columns) of data set. While filter is the most important operation to subset rows, there are a number of other functions in dplyr that reduce the rows. On the other hand, to subset columns with dplyr there is only one function, dplyr::select.
Subsetting Rows
The following statements create a data frame of dimension 100 x 2, containing two columns of integers randomly selected from 1–10 with replacement:
set.seed(76)
dat <- data.frame(
x = sample(10, 100, rep = TRUE),
y = sample(10, 100, rep = TRUE)) %>%
glimpse()Rows: 100
Columns: 2
$ x <int> 5, 1, 2, 10, 7, 5, 6, 10, 8, 6, 4, 6, 3, 1, 8, 5, 7, 6, 4, 1, 3, 10,…
$ y <int> 3, 5, 9, 8, 7, 10, 4, 1, 10, 6, 9, 6, 2, 9, 3, 8, 9, 7, 4, 3, 2, 3, …
filter
filter extracts rows that meet a logical condition. The following statement selects the rows for which \(x \in (1,5)\) and \(y \in (3,8)\).
dat %>% filter(x %in% c(1,5) & y %in% c(3,8)) x y
1 5 3
2 5 8
3 1 3
4 5 8
5 5 8
6 5 3
You can also list the conditions that are combined with logical “and”, separated with commas:
dat %>% filter(x %in% c(1,5), y %in% c(3,8)) x y
1 5 3
2 5 8
3 1 3
4 5 8
5 5 8
6 5 3
The following statement extracts the rows where the value of y exceeds 3.5 times its standard deviation:
dat %>% filter(y > 3.5*sd(y)) x y
1 5 10
2 8 10
3 7 10
4 1 10
5 8 10
6 1 10
7 9 10
8 9 10
9 6 10
10 7 10
11 4 10
12 5 10
Another subsetting operation is to remove duplicate observations from a data set with distinct. When applied to a subset of the columns, distinct returns the combinations of their unique values in the data.
distinct
dat %>% distinct(x) x
1 5
2 1
3 2
4 10
5 7
6 6
7 8
8 4
9 3
10 9
If you specify .keep_all=TRUE, all other variables in the data set are kept as well. If multiple combinations occur, the function retains the first occurrence.
dat %>% distinct(x, .keep_all=TRUE) x y
1 5 3
2 1 5
3 2 9
4 10 8
5 7 7
6 6 4
7 8 10
8 4 9
9 3 2
10 9 3
To remove all duplicates in a data frame, simply call distinct on the data frame.
nuniq <- dat %>% distinct() %>% summarize(count=n())
nuniq count
1 65
There are 65 unique rows of data in the dat data frame.
Other subsetting functions
Other functions subsetting rows in dplyr are
dplyr::sample_frac: randomly select a proportion of the rowsdplyr::sample_n: randomly select a fixed number of rowsdplyr::slice: select rows by position, for exampleslice(dat,4:10)extracts rows 4–10dplyr::slice_head: selects the first rowsdplyr::slice_tail: selects the last rowsdplyr::slice_min: select rows with the smallest valuesdplyr::slice_max: select rows with the largest values
For example, the next statement selects the observation with the five largest values for the Sepal.Width variable in the iris data set.
iris %>% slice_max(Sepal.Width, n=5) Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.7 4.4 1.5 0.4 setosa
2 5.5 4.2 1.4 0.2 setosa
3 5.2 4.1 1.5 0.1 setosa
4 5.8 4.0 1.2 0.2 setosa
5 5.4 3.9 1.7 0.4 setosa
6 5.4 3.9 1.3 0.4 setosa
Subsetting (selecting) Columns
select
To subset columns there is only one statement in dplyr, the select statement. However, it is very versatile because of its many helper functions.
The basic usage is to list the column names being selected:
gapminder %>% select(gdpPercap, lifeExp) %>%
summarize(mnGDP=mean(gdpPercap), sdLife=sd(lifeExp)) mnGDP sdLife
1 7215.327 12.91711
You can also specify an exclusion with a negative selection
gapminder %>% select(-country, -continent, -pop) %>%
summarize(mnGDP=mean(gdpPercap), sdLife=sd(lifeExp)) mnGDP sdLife
1 7215.327 12.91711
or
gapminder %>% select(-c(country, continent, pop)) %>%
summarize(mnGDP=mean(gdpPercap), sdLife=sd(lifeExp)) mnGDP sdLife
1 7215.327 12.91711
Helper functions
Here are important helper functions for dplyr::select
select( contains(".")): select columns whose name contains a character stringselect( ends_with("Length")): select columns whose name ends in the specified stringselect( starts_with("Sepal")): select columns whose name starts with the specified stringselect( everything()): select all columnsselect( matches(".t.")): select the columns whose name matches a regular expressionselect( num_range("x",1:5)): select the columns namedx1,x2, …,x5select( one_off("Species","Genus")): select columns whose names are in the specified group of namesselect( Sepal.Length:Petal.Width): Select all columns betweenSepal.LengthandPetal.Width, including those columns
4.5 Creating New Variables
The principal function to create new variables in a data frame is dplyr::mutate. Variations are mutate_each which applies a function to every column and transmute which drops the original columns.
The following statements compute the GPD as the product of per-capita GDP and population size and finds the top-5 countries by GDP in Asia in 2007:
gapminder %>%
filter(continent == "Asia", year == 2007) %>%
mutate(GDP = gdpPercap * pop) %>%
select(country, year, gdpPercap, GDP) %>%
slice_max(GDP,n=5) country year gdpPercap GDP
1 China 2007 4959.115 6.539501e+12
2 Japan 2007 31656.068 4.035135e+12
3 India 2007 2452.210 2.722925e+12
4 Korea Rep. 2007 23348.140 1.145105e+12
5 Iran 2007 11605.714 8.060583e+11
The variable GDP created in this operation is transient. It is instantiated for the duration of the pipeline and is not added to the gapminder data frame. If you want to add a new variable to an existing data frame you need to assign the result to a return object.
4.6 Combining Data Sets
The process of combining tables is based on joins, set operations, or bindings. A join uses the values in specific columns of one data set to match records with another data set. A set operation is a merging of columns without considering the values in the columns. Appending rows rows of one table to another table or columns of one table to another table are binding operations. What happens to columns that exist in one data set but not in the other during the append depends on the implementation. Similarly, what happens to columns that share the same name when tables are merged horizontally depends on the implementation.
To show the various join and set operations, let’s create three small data frames of cities and weather information.
capitals1 <- data.frame(city=c('Amsterdam','Berlin'),
country=c('NL','Germany'))
capitals2 <- data.frame(city=c('Amsterdam','Washington, D.C.'),
country=c('NL','U.S.A.'))
weather1 <- data.frame(city=c('Amsterdam','Seattle'),
degrees=c(10,8),
date=c("2022-10-14","2022-10-12"))
capitals1 city country
1 Amsterdam NL
2 Berlin Germany
capitals2 city country
1 Amsterdam NL
2 Washington, D.C. U.S.A.
weather1 city degrees date
1 Amsterdam 10 2022-10-14
2 Seattle 8 2022-10-12
Set Operations
We distinguish three set operations: the intersection of rows that appear in two data sets, the union of the rows, and the set difference of the rows. dplyr supports set operations with the following functions:
intersect(x, y): finds all rows in bothxandy.union(x, y): finds all rows in eitherxory, excluding duplicates.union_all(x, y): finds all rows in eitherxory, including duplicates.setdiff(x, y): finds all rows inxthat are not iny.symdiff(x, y): computes the symmetric difference, i.e. all rows inxthat are not inyand all rows inythat are not in x.setequal(x, y): returns TRUE ifxandycontain the same rows (ignoring order).
Note that except for union_all the functions that return rows remove duplicates in x and y.
For the two data frames of capitals, here are the results of the various set operations.
dplyr::intersect(capitals1, capitals2) city country
1 Amsterdam NL
dplyr::setdiff(capitals1, capitals2) city country
1 Berlin Germany
dplyr::union(capitals1, capitals2) city country
1 Amsterdam NL
2 Berlin Germany
3 Washington, D.C. U.S.A.
dplyr::union_all(capitals1, capitals2) city country
1 Amsterdam NL
2 Berlin Germany
3 Amsterdam NL
4 Washington, D.C. U.S.A.
dplyr::symdiff(capitals1, capitals2) city country
1 Berlin Germany
2 Washington, D.C. U.S.A.
Joins
Set operations combine or reduce rows of data (vertically). Join operations combine data sets horizontally. Joins typically are based on the values in columns of the data sets to find matches.
Joins are categorized as outer or inner joins depending on whether rows with matches are returned. An outer join returns rows that do not have any matches whereas the inner join returns only rows that get paired. The two data frames in a join are called the left and right sides of the relation and outer joins are further classified as
Left outer join: all rows from the left side of the relation appear at least once.
Right outer join: all rows from the right side of the relation appear at least once.
Full outer join: all rows from both sides of the relation appear at least once.
To demonstrate the joins in dplyr let’s set up some simple tables:
weather <- data.frame(name=c('San Francisco','San Francisco','Hayward'),
temp_lo=c(46,43,37),
temp_hi=c(50,57,54),
prcp=c(0.25,0.0,NA),
date=c('1994-11-27','1994-11-29','1994-11-29'))
cities <- data.frame(name=c('San Francisco'),
lat=c(-194.0),
lon=c(53.0))
weather name temp_lo temp_hi prcp date
1 San Francisco 46 50 0.25 1994-11-27
2 San Francisco 43 57 0.00 1994-11-29
3 Hayward 37 54 NA 1994-11-29
cities name lat lon
1 San Francisco -194 53
An inner join between the data frames on the columns that contain the city names will match the records for San Francisco:
dplyr::inner_join(weather,cities,by="name") name temp_lo temp_hi prcp date lat lon
1 San Francisco 46 50 0.25 1994-11-27 -194 53
2 San Francisco 43 57 0.00 1994-11-29 -194 53
Note that the values for lat and lon are repeated for every row in the weather table that matches the join in the relation. Because this is an inner join and the weather table had no matching row for city Hayward, this city does not appear in the join result. We can change that by modifying the type of join to a left outer join:
dplyr::left_join(weather,cities,by="name") name temp_lo temp_hi prcp date lat lon
1 San Francisco 46 50 0.25 1994-11-27 -194 53
2 San Francisco 43 57 0.00 1994-11-29 -194 53
3 Hayward 37 54 NA 1994-11-29 NA NA
Because the join is an outer join, rows that do not have matches in the relation are returned. Because the outer join is a left join, every row on the left side of the relation is returned (at least once). If you change the left- and right-hand side of the relation you can achieve the same result by using a right outer join:
dplyr::right_join(cities,weather,by="name") name lat lon temp_lo temp_hi prcp date
1 San Francisco -194 53 46 50 0.25 1994-11-27
2 San Francisco -194 53 43 57 0.00 1994-11-29
3 Hayward NA NA 37 54 NA 1994-11-29
The left join retains all observations in the left data frame (first argument). The right join retains all observations in the right data frame (second argument).
Now let’s add another record to the cities data frame without a matching record in the weather data frame:
cities <- dplyr::bind_rows(cities,data.frame(name="New York",
lat=40.7,
lon=-73.9))
cities name lat lon
1 San Francisco -194.0 53.0
2 New York 40.7 -73.9
A full outer join between the cities and weather data frames ensures that rows from both sides of the relation appear at least once:
dplyr::full_join(cities,weather,by="name") name lat lon temp_lo temp_hi prcp date
1 San Francisco -194.0 53.0 46 50 0.25 1994-11-27
2 San Francisco -194.0 53.0 43 57 0.00 1994-11-29
3 New York 40.7 -73.9 NA NA NA <NA>
4 Hayward NA NA 37 54 NA 1994-11-29
Bindings
For data scientists working with rectangular data frames in which observations have a natural order, merging data horizontally is a standard operation. Observations are matched by position and not according to the values in key columns. In the world of relational databases, such a merge is called a positional join and a somewhat unnatural operation because relational tables do not work from a natural ordering of the data, they are based on keys and indices.
When working with statistical data sets merging by position is common as data sets do not have keys. The positional join—or column binding—matches data frames row-by-row such that rows from both tables appear at least once:
dplyr::bind_cols(capitals1,weather1)New names:
• `city` -> `city...1`
• `city` -> `city...3`
city...1 country city...3 degrees date
1 Amsterdam NL Amsterdam 10 2022-10-14
2 Berlin Germany Seattle 8 2022-10-12
Note that both data frames contribute a city variable and these are renamed to resolve name collision.
A similar operation binding rows stacks one data frame on top of another. The result from dplyr::bind_rows contains all columns that appear in any of the inputs, unobserved combinations are set to NA.
dplyr::bind_rows(capitals1,capitals2) city country
1 Amsterdam NL
2 Berlin Germany
3 Amsterdam NL
4 Washington, D.C. U.S.A.
dplyr::bind_rows(capitals1,weather1) city country degrees date
1 Amsterdam NL NA <NA>
2 Berlin Germany NA <NA>
3 Amsterdam <NA> 10 2022-10-14
4 Seattle <NA> 8 2022-10-12