import pandas as pd
import polars as pl
= pd.read_csv("../datasets/iris.csv")
pandas_df = pl.read_csv("../datasets/iris.csv") polars_df
11 Data Access
In this section we use Python libraries to access data in the major file formats presented in a previous chapter and convert the files into DataFrames: CSV, JSON, Parquet, and ORC files. The Pandas DataFrame format is common but there are others to consider:
Polars DataFrames are highly parallelized, support more efficient string handling than Pandas, and allow both lazy and eager execution.
Arrow tables have highly efficient in-memory columnar format for fast analytics.
Python has built-in functions for reading data formats, for example, csv.reader() for CSV files, we use the pandas and Pyarrow packages to read and write files in common formats. Pandas supports an impressive list of I/O formats and data sets likely end up as pandas DataFrames anyway, so we might as well convert from file format directly to a DataFrame. The Pyarrow library is becoming increasingly important in data engineering for data science; Pandas uses the Pyarrow engine for several of the important formats, e.g., parquet files. In Pandas 2.0 you can choose whether to use a NumPy-based engine or a Pyarrow-based engine for data manipulation and analysis.
Fortunately, the methods for reading data in Pandas and Polars are very similar, so it is relatively easy to switch from one DataFrame representation to another. For example, the following statements show how to read a CSV file into a Pandas and a Polars DataFrame.
Once you start working with DataFrames the syntax can be quite different, however.
11.1 Accessing Local Files
CSV files
To convert CSV files to a pandas DataFrame use the pandas read_csv()
method. The default behavior is to look for a header line with column names, to interpret commas as field delimiters, to skip blank lines in the file, and to apply utf-8 encoding.
Consider the following file, fitness.csv
:
data> head fitness.csv
Age,Weight,Oxygen,RunTime,RestPulse,RunPulse,MaxPulse
44,89.47,44.609,11.37,62,178,182
40,75.07,45.313,10.07,62,185,185
44,85.84,54.297, 8.65,45,156,168
42,68.15,59.571, 8.17,40,166,172
38,89.02,49.874, 9.22,55,178,180
The first line contains comma-separated names for the columns. The following lines contain the data for the attributes, one record per line. To convert the CSV file to a pandas DataFrame use the read_csv()
method:
import numpy as np
= pd.read_csv("../datasets/fitness.csv")
df print("Default result from read_csv")
print(df.head())
=[np.float64]) df.describe(include
Default result from read_csv
Age Weight Oxygen RunTime RestPulse RunPulse MaxPulse
0 44 89.47 44.609 11.37 62 178 182
1 40 75.07 45.313 10.07 62 185 185
2 44 85.84 54.297 8.65 45 156 168
3 42 68.15 59.571 8.17 40 166 172
4 38 89.02 49.874 9.22 55 178 180
Weight | Oxygen | RunTime | |
---|---|---|---|
count | 31.000000 | 31.000000 | 31.000000 |
mean | 77.444516 | 47.375806 | 10.586129 |
std | 8.328568 | 5.327231 | 1.387414 |
min | 59.080000 | 37.388000 | 8.170000 |
25% | 73.200000 | 44.964500 | 9.780000 |
50% | 77.450000 | 46.774000 | 10.470000 |
75% | 82.325000 | 50.131000 | 11.270000 |
max | 91.630000 | 60.055000 | 14.030000 |
You can assign your own variable names with the names=
option:
= pd.read_csv("../datasets/fitness.csv",
df2 =0,
header=["Age", "Wgt","Oxy","RT","ReP","RuP","MP" ])
names df2.head()
Age | Wgt | Oxy | RT | ReP | RuP | MP | |
---|---|---|---|---|---|---|---|
0 | 44 | 89.47 | 44.609 | 11.37 | 62 | 178 | 182 |
1 | 40 | 75.07 | 45.313 | 10.07 | 62 | 185 | 185 |
2 | 44 | 85.84 | 54.297 | 8.65 | 45 | 156 | 168 |
3 | 42 | 68.15 | 59.571 | 8.17 | 40 | 166 | 172 |
4 | 38 | 89.02 | 49.874 | 9.22 | 55 | 178 | 180 |
String variables are indicated in the CSV file with quotes:
data> head herding-cats.csv
"address_full","street","coat","sex","age","weight","fixed","wander_dist","roamer","cat_id"
"15 Fillmer Ave Los Gatos 95030","15 Fillmer Ave","brown","male",5.594,5.016,1,0.115,"yes",1
"244 Harding Ave Los Gatos 95030","244 Harding Ave","tabby","male",6.852,6.314,1,0.129,"yes",2
"16570 Marchmont Dr Los Gatos 95032","16570 Marchmont Dr","maltese","female",4.081, 2.652,0, 0.129,"yes",3
"100 Stonybrook Rd Los Gatos 95032","100 Stonybrook Rd","tabby","female",3.806,3.413,1, 0.107,"yes",4
= pd.read_csv("../datasets/herding-cats.csv",
cats ="cat_id")
index_colprint(cats.head())
address_full street coat \
cat_id
1 15 Fillmer Ave Los Gatos 95030 15 Fillmer Ave brown
2 244 Harding Ave Los Gatos 95030 244 Harding Ave tabby
3 16570 Marchmont Dr Los Gatos 95032 16570 Marchmont Dr maltese
4 100 Stonybrook Rd Los Gatos 95032 100 Stonybrook Rd tabby
5 266 Kennedy Rd Los Gatos 95032 266 Kennedy Rd calico
sex age weight fixed wander_dist roamer
cat_id
1 male 5.594 5.016 1 0.115 yes
2 male 6.852 6.314 1 0.129 yes
3 female 4.081 2.652 0 0.129 yes
4 female 3.806 3.413 1 0.107 yes
5 male 5.164 6.820 1 0.039 no
Pandas is smart about assigning data types to columns, note that string columns are stored as data type object
.
cats.info()
<class 'pandas.core.frame.DataFrame'>
Index: 400 entries, 1 to 400
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 address_full 400 non-null object
1 street 394 non-null object
2 coat 400 non-null object
3 sex 400 non-null object
4 age 400 non-null float64
5 weight 400 non-null float64
6 fixed 400 non-null int64
7 wander_dist 400 non-null float64
8 roamer 400 non-null object
dtypes: float64(3), int64(1), object(5)
memory usage: 31.2+ KB
You can overwrite the data types by supplying either a default type or a dictionary of types. To change the data type for the fixed
column, for example:
= pd.read_csv("../datasets/herding-cats.csv",
cats ={'fixed':np.float16}) dtype
Missing values (see below) can be indicated in CSV files in several ways. The common technique is to leave the entry for the unobserved value empty (no spaces). In the following file the value for appraisal
is missing in the third row. The fifth row contains values for land
and appraisal
only.
data> head landsales2.csv
land, improve, total, sale, appraisal
30000,64831,94831,118500,1.25
30000,50765,80765,93900,1.16
46651,18573,65224,,
45990,91402,137392,184000,1.34
42394,,,168000,
,133351,,169000,
63596,2182,65778,,
56658,153806,210464,255000,1.21
read_csv()
converts this file correctly to missing values using the NaN (not-a-number) sentinel for the unobserved data:
= pd.read_csv("../datasets/landsales2.csv")
land land
land | improve | total | sale | appraisal | |
---|---|---|---|---|---|
0 | 30000.0 | 64831.0 | 94831.0 | 118500.0 | 1.25 |
1 | 30000.0 | 50765.0 | 80765.0 | 93900.0 | 1.16 |
2 | 46651.0 | 18573.0 | 65224.0 | NaN | NaN |
3 | 45990.0 | 91402.0 | 137392.0 | 184000.0 | 1.34 |
4 | 42394.0 | NaN | NaN | 168000.0 | NaN |
5 | NaN | 133351.0 | NaN | 169000.0 | NaN |
6 | 63596.0 | 2182.0 | 65778.0 | NaN | NaN |
7 | 56658.0 | 153806.0 | 210464.0 | 255000.0 | 1.21 |
8 | 51428.0 | 72451.0 | 123879.0 | NaN | NaN |
9 | 93200.0 | NaN | NaN | 422000.0 | NaN |
10 | 76125.0 | 78172.0 | 275297.0 | 290000.0 | 1.14 |
11 | 154360.0 | 61934.0 | 216294.0 | 237000.0 | 1.10 |
12 | 65376.0 | NaN | NaN | 286500.0 | NaN |
13 | 42400.0 | NaN | NaN | NaN | NaN |
14 | 40800.0 | 92606.0 | 133406.0 | 168000.0 | 1.26 |
In addition to empty entries, read.csv() interprets a number of entries as missing (unobserved) values: “#N/A”, “#N/A N/A”, “#NA”, “-1.#IND”, “-1.#QNAN”, “-NaN”, “-nan”, “1.#IND”, “1.#QNAN”, “<NA>”, “N/A”, “NA”, “NULL”, “NaN”, “None”, “n/a”, “nan”, “null “. Despite those, CSV files might have special values to indicate missing values, for example, when they are exported from other software packages. Suppose that the missing values are indicated by a dot (“.”). Also, the columns in the CSV file are formatted with white space:
data> head landsales.csv
land, improve, total, sale, appraisal
30000, 64831, 94831, 118500, 1.25
30000, 50765, 80765, 93900, 1.16
46651, 18573, 65224, ., .
45990, 91402, 137392, 184000, 1.34
42394, ., ., 168000, .
., 133351, ., 169000, .
63596, 2182, 65778, ., .
56658, 153806, 210464, 255000, 1.21
51428, 72451, 123879, ., .
The file is much more human readable, but requires additional options to convert to the intended DataFrame:
= pd.read_csv("../datasets/landsales.csv",
land2 =".",
na_values=True)
skipinitialspace land2.head()
land | improve | total | sale | appraisal | |
---|---|---|---|---|---|
0 | 30000.0 | 64831.0 | 94831.0 | 118500.0 | 1.25 |
1 | 30000.0 | 50765.0 | 80765.0 | 93900.0 | 1.16 |
2 | 46651.0 | 18573.0 | 65224.0 | NaN | NaN |
3 | 45990.0 | 91402.0 | 137392.0 | 184000.0 | 1.34 |
4 | 42394.0 | NaN | NaN | 168000.0 | NaN |
The na.values=
option specifies additional values that should be recognized as missing values. The skipinitalspace=
option tellsread_csv()
to ignore spaces following the delimiters. The result is as intended.
JSON files
Pandas has great support for reading and writing JSON files. A general issue is that JSON is a hierarchical data format that allows nested data structures whereas pandas DataFrames have a row—column layout. Fortunately, the pandasread_json()
method has many options to shape nested JSON structures into DataFrames.
You can convert any JSON document into a Python dictionary with the load() function in the json library. Suppose we want to read this document, stored in ../datasets/JSON/simple1.json
into Python:
> cat ../datasets/JSON/simple1.json
{
"firstName": "Jane",
"lastName": "Doe",
"hobbies": ["running", "sky diving", "singing"],
"age": 35,
"children": [
{
"firstName": "Alice",
"age": 6
},
{
"firstName": "Bob",
"age": 8
}
]
}
import numpy as np
import pandas as pd
import json
with open("../datasets/JSON/simple1.json","r") as f:
= json.load(f)
data data
{'firstName': 'Jane',
'lastName': 'Doe',
'hobbies': ['running', 'sky diving', 'singing'],
'age': 35,
'children': [{'firstName': 'Alice', 'age': 6},
{'firstName': 'Bob', 'age': 8}]}
The next example reads a record-oriented JSON file (../datasets/JSON/simple.json
) directly into a DataFrame. Each record consists of fields id
, name
, math
, statistics
, and weight
.
cat simple.json
[
{
"id": "A00123",
"name": "David",
"math": 70,
"statistics": 86,
"weight": 156.3
},
{
"id": "B00422",
"name": "Andrew",
"math": 89,
"statistics": 80,
"weight": 210.6
},
{
"id": "C004543",
"name": "Tobias",
"math": 79,
"statistics": 90,
"weight": 167.0
}
]
= pd.read_json("../datasets/JSON/simple.json")
df
df.info() df
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 3 non-null object
1 name 3 non-null object
2 math 3 non-null int64
3 statistics 3 non-null int64
4 weight 3 non-null float64
dtypes: float64(1), int64(2), object(2)
memory usage: 252.0+ bytes
id | name | math | statistics | weight | |
---|---|---|---|---|---|
0 | A00123 | David | 70 | 86 | 156.3 |
1 | B00422 | Andrew | 89 | 80 | 210.6 |
2 | C004543 | Tobias | 79 | 90 | 167.0 |
If you’d rather assign data type string to the id
and name
fields than data type object, you can set the data type with
= pd.read_json("../datasets/JSON/simple.json",
df2 = {"id": "string", "name" : "string"})
dtype df2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 3 non-null string
1 name 3 non-null string
2 math 3 non-null int64
3 statistics 3 non-null int64
4 weight 3 non-null float64
dtypes: float64(1), int64(2), string(2)
memory usage: 252.0 bytes
How can we handle a nested JSON structure? The data in ../datasets/JSON/nested_array.json
is
data> cat nested_array.json
{
"school_name": "Virginia Tech",
"class": "Class of 2026",
"students": [
{
"id": "A00123",
"name": "David",
"math": 70,
"statistics": 86,
"weight": 156.3
},
{
"id": "B00422",
"name": "Andrew",
"math": 89,
"statistics": 80,
"weight": 210.6
},
{
"id": "C004543",
"name": "Tobias",
"math": 79,
"statistics": 90,
"weight": 167.0
}]
}
The students
fields have a record structure but there are additional fields school_name
and class
that do not fit that structure. Converting the JSON file directly into a DataFrame does not produce the desired result:
= pd.read_json('../datasets/JSON/nested_array.json')
df3 df3
school_name | class | students | |
---|---|---|---|
0 | Virginia Tech | Class of 2026 | {'id': 'A00123', 'name': 'David', 'math': 70, ... |
1 | Virginia Tech | Class of 2026 | {'id': 'B00422', 'name': 'Andrew', 'math': 89,... |
2 | Virginia Tech | Class of 2026 | {'id': 'C004543', 'name': 'Tobias', 'math': 79... |
To create the appropriate DataFrame we take two steps: convert the JSON file into a dictionary, then shape the dictionary into a DataFrame with json_normalize()
:
with open('../datasets/JSON/nested_array.json','r') as f:
= json.load(f)
data
= pd.json_normalize(data,
df4 =["students"],
record_path =["school_name", "class"])
meta df4
id | name | math | statistics | weight | school_name | class | |
---|---|---|---|---|---|---|---|
0 | A00123 | David | 70 | 86 | 156.3 | Virginia Tech | Class of 2026 |
1 | B00422 | Andrew | 89 | 80 | 210.6 | Virginia Tech | Class of 2026 |
2 | C004543 | Tobias | 79 | 90 | 167.0 | Virginia Tech | Class of 2026 |
The meta=
option lists the JSON fields that are used as data that applies to each record in the result table. The record_path=
option points at the field that contains the list of records.
Now suppose that in addition to the list of records the JSON metadata contains a more complex structure:
cat nested_obj_array.json
{
"school_name": "Virginia Tech",
"class": "Class of 2026",
"info": {
"president": "Timothy D. Sands",
"address": "Office of the President",
"social": {
"LinkedIn": "tim-sands-49b8b95",
"Twitter": "@VTSandsman"
}
},
"students": [
{
"id": "A00123",
"name": "David",
"math": 70,
"statistics": 86,
"weight": 156.3
},
. . .
}]
}
You can specify which fields to extract from the metadata in the meta=
option of json_normalize()
:
with open('../datasets/JSON/nested_obj_array.json','r') as f:
= json.load(f)
data
= pd.json_normalize(data,
df5 =["students"],
record_path =["school_name","class",
meta"info", "president"],
["info", "social", "Twitter"]])
[ df5
id | name | math | statistics | weight | school_name | class | info.president | info.social.Twitter | |
---|---|---|---|---|---|---|---|---|---|
0 | A00123 | David | 70 | 86 | 156.3 | Virginia Tech | Class of 2026 | Timothy D. Sands | @VTSandsman |
1 | B00422 | Andrew | 89 | 80 | 210.6 | Virginia Tech | Class of 2026 | Timothy D. Sands | @VTSandsman |
2 | C004543 | Tobias | 79 | 90 | 167.0 | Virginia Tech | Class of 2026 | Timothy D. Sands | @VTSandsman |
How are missing values handled? In the next example, the first has a complete record. The second student misses a statistics score, the third student has no weight data. pd.json_normalize()
fills in NaN for the unobserved data.
= [
data
{"students": [
{"id": "A00123",
"name": "David",
"math": 70,
"statistics": 86,
"weight": 156.3
},
{"id": "B00422",
"name": "Andrew",
"math": 89,
"weight": 210.6
},
{"id": "C004543",
"name": "Tobias",
"math": 79,
"statistics": 90,
}
]
}
]
= pd.json_normalize(data, record_path =["students"])
df6 df6
id | name | math | statistics | weight | |
---|---|---|---|---|---|
0 | A00123 | David | 70 | 86.0 | 156.3 |
1 | B00422 | Andrew | 89 | NaN | 210.6 |
2 | C004543 | Tobias | 79 | 90.0 | NaN |
json_normalize()
is less forgiving if fields in the metadata are unobserved; it will throw an error. To prevent the error and assign NaN values, use the errors=’ignore’
option. In the following example the teachers
meta data field does not always have a l_name
entry.
= [
data
{ 'class': 'STAT 5525',
'student count': 60,
'room': 'Yellow',
'teachers': {
'f_name': 'Elon',
'l_name': 'Musk',
},'students': [
'name': 'Tom', 'sex': 'M' },
{ 'name': 'James', 'sex': 'M' },
{
]
},
{ 'class': 'STAT 5526',
'student count': 45,
'room': 'Blue',
'teachers': {
'f_name': 'Albert'
},'students': [
'name': 'Tony', 'sex': 'M' },
{ 'name': 'Jacqueline', 'sex': 'F' },
{
]
},
]
= pd.json_normalize(data,
df7 =["students"],
record_path =["class", "room", ["teachers", "l_name"]],
meta='ignore')
errors
df7
name | sex | class | room | teachers.l_name | |
---|---|---|---|---|---|
0 | Tom | M | STAT 5525 | Yellow | Musk |
1 | James | M | STAT 5525 | Yellow | Musk |
2 | Tony | M | STAT 5526 | Blue | NaN |
3 | Jacqueline | F | STAT 5526 | Blue | NaN |
Parquet files
To read parquet files into pandas DataFrames we have two choices: the read_parquet() pandas method or the read_table()
method from pyarrow.parquet.
The data we work with in this section is stored in a parquet multi-file structure; that means the data are split into multiple files stored in a local directory:
Parquet ll userdata
total 1120
-rw-r--r--@ 1 olivers staff 111K Aug 31 14:36 userdata1.parquet
-rw-r--r--@ 1 olivers staff 110K Aug 31 14:36 userdata2.parquet
-rw-r--r--@ 1 olivers staff 111K Aug 31 14:37 userdata3.parquet
-rw-r--r--@ 1 olivers staff 110K Aug 31 14:37 userdata4.parquet
-rw-r--r--@ 1 olivers staff 111K Aug 31 14:37 userdata5.parquet
Each of the *.parquet files in the directory contains 1,000 observations.
Pandas’ read_parquet()
reads the entire directory and combines the records from the multiple files into a single DataFrame.
= pd.read_parquet("../datasets/Parquet/userdata", engine="pyarrow")
df
df.info() df.describe()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 registration_dttm 5000 non-null datetime64[ns]
1 id 4999 non-null float64
2 first_name 5000 non-null object
3 last_name 5000 non-null object
4 email 5000 non-null object
5 gender 5000 non-null object
6 ip_address 5000 non-null object
7 cc 5000 non-null object
8 country 5000 non-null object
9 birthdate 5000 non-null object
10 salary 4689 non-null float64
11 title 5000 non-null object
12 comments 4966 non-null object
dtypes: datetime64[ns](1), float64(2), object(10)
memory usage: 507.9+ KB
registration_dttm | id | salary | |
---|---|---|---|
count | 5000 | 4999.000000 | 4689.000000 |
mean | 2016-02-03 21:04:13.699399936 | 500.598720 | 150772.222890 |
min | 2016-02-03 00:00:07 | 1.000000 | 12068.960000 |
25% | 2016-02-03 08:51:55.750000128 | 251.000000 | 83480.380000 |
50% | 2016-02-03 20:01:17 | 501.000000 | 152877.190000 |
75% | 2016-02-04 07:42:54.500000 | 750.500000 | 215405.220000 |
max | 2016-02-04 23:59:55 | 1000.000000 | 286735.820000 |
std | NaN | 288.648331 | 78171.513062 |
The size of df
confirms that all 5,000 observations were retrieved from the five component files.
With pyarrow we have more control over the files, for example, we can inquire about the schema and metadata of the component files.
import pyarrow.parquet as pq
= pq.ParquetFile("../datasets/Parquet/userdata/userdata1.parquet")
pq_file
pq_file.metadata pq_file.schema
<pyarrow._parquet.ParquetSchema object at 0x147814b00>
required group field_id=-1 hive_schema {
optional int96 field_id=-1 registration_dttm;
optional int32 field_id=-1 id;
optional binary field_id=-1 first_name (String);
optional binary field_id=-1 last_name (String);
optional binary field_id=-1 email (String);
optional binary field_id=-1 gender (String);
optional binary field_id=-1 ip_address (String);
optional binary field_id=-1 cc (String);
optional binary field_id=-1 country (String);
optional binary field_id=-1 birthdate (String);
optional double field_id=-1 salary;
optional binary field_id=-1 title (String);
optional binary field_id=-1 comments (String);
}
To convert all component files into a pandas DataFrame with pyarrow, use the read_table()
method and send the result to_pandas()
:
= pq.read_table(source="../datasets/Parquet/userdata/").to_pandas()
df2 df2.describe()
registration_dttm | id | salary | |
---|---|---|---|
count | 5000 | 4999.000000 | 4689.000000 |
mean | 2016-02-03 21:04:13.699399936 | 500.598720 | 150772.222890 |
min | 2016-02-03 00:00:07 | 1.000000 | 12068.960000 |
25% | 2016-02-03 08:51:55.750000128 | 251.000000 | 83480.380000 |
50% | 2016-02-03 20:01:17 | 501.000000 | 152877.190000 |
75% | 2016-02-04 07:42:54.500000 | 750.500000 | 215405.220000 |
max | 2016-02-04 23:59:55 | 1000.000000 | 286735.820000 |
std | NaN | 288.648331 | 78171.513062 |
ORC files
Pandas (and pyarrow) supports ORC files but working with ORC files is not quite as convenient as with parquet files. The read_orc()
method reads one ORC file rather than an entire directory and it is not supported on Windows.
= pd.read_orc("../datasets/ORC/userdata/userdata1.orc") df
When data is stored in multi-file format, as in the parquet example, you need to concatenate the contents into a pandas DataFrame. The glob module is helpful to retrieve the list of file names that match a pattern.
import glob
= "../datasets/ORC/userdata"
folder_path = glob.glob(folder_path + "/*.orc")
file_list = pd.DataFrame(pd.read_orc(file_list[0]))
df_from_orc
for i in range(1,len(file_list)):
= pd.read_orc(file_list[i])
data = pd.DataFrame(data)
df = pd.concat([df_from_orc,df],axis=0) df_from_orc
You can also use the read_table()
method for ORC files in pyarrow to read ORC an file into a pyarrow table and to_pandas()
to convert the table to a pandas DataFrame:
from pyarrow import orc
= orc.read_table(source="../datasets/ORC/userdata/userdata1.orc").to_pandas() df
11.2 Working with a Database
As a data scientist you need to be comfortable working with databases. In analytical work, you will encounter schema-centric relational OLAP databases that use SQL (structured query language) to interact with the database and the data. SQL interfaces are standard for data warehouses and common for many data platforms and tools.
Data scientists generally prefer writing Python or R code over SQL. SQL is extremely powerful to express relational concepts and can make some data processing tasks much easier than rolling your own code. Also, databases are optimized for joins, merges, deep queries, aggregations, indexes, handling JSON, etc. It would be a shame not to take advantage of that power or to lose it behind the syntactic sugar of a tool’s API. Many employers have voiced concerns that SQL ranks high among the technical skills data scientists often lack. As the founder and chairman of a data science research and consultancy company put it:
There are a lot of things folks don’t agree on. But everyone agrees that as a data scientist you will need to retrieve data from a database, and you will need version control. Learn SQL and learn git.
We are covering SQL basics in Chapter 15 after sprinkling some SQL statements along the way.
If you want to work with SQL directly from Python, there are several alternatives.
SQLAlchemy
SQLAlchemy is a SQL toolkit for Python that gives Python developers the power of SQL. While some tools try to hide SQL and relational details behind syntactic sugar, SQLAlchemy’s philosophy is to fully expose the relational details of SQL within transparent tools. Specific databases, beyond those for which support is built into SQLAlchemy, are supported as dialects.
Ibis
Ibis is a portable Python dataframe library that supports multiple database backends. After making a connection to a backend database, you interact with the data using Python syntax rather than SQL statements. The Ibis function calls are translated into SQL for the specific backend under the covers. For example, the first statement in the next code block creates a Python object named penguins
through which you can interact with the database table of the same name. The second statement filters the data for a specific species.
= con.table("penguins")
penguins filter(penguins.species == "Adelie") penguins.
The SQL equivalent would be a SELECT statement with a WHERE clause:
SELECT * FROM penguins WHERE species = “Adelie”
Having stable Python code for different database backends makes it easy to move from one database to another. During development of analytic code you might be working with SQLite and the production database is Oracle. The same Python code will work in both environments.
The default backend of Ibis is the embedded database DuckDB. If you are working with DuckDB you can issue SQL statements directly from Python or you can rely on its dedicated Python API.
DuckDB
DuckDB has all the ingredients we are looking for in a database for data science work:
Designed to support analytical queries (OLAP workloads) characterized by complex queries that process large amounts of data.
Columnar, vectorized storage engine. Columnar storage is more efficient for analytical queries, vectorized queries execute much faster than systems that process rows sequentially.
Provides transactional guarantees (ACID properties)
Deeply integrated with Python or R for efficient data analysis.
Easy to install and operate, without any external dependencies.
Open-source database that is free to use.
Can query CSV, JSON, Parquet files and DataFrames directly.
Supports a streaming model for analyzing large data sets that exceed memory capacity, e.g., streaming from Parquet files.
In-memory database with graceful on-disk operation.
Easy transition to the cloud through partnership with MotherDuck.
SQL code can be submitted from the CLI or through the Python API, so you can take advantage of the power of SQL.
Runs as an embedded database within a host process.
Because DuckDB is an embedded database, there is no database server to install and to maintain. Because the database runs within the host process, data can be transferred very quickly. And if you want to move the data to the cloud, you can do that easily with the MotherDuck serverless platform.
The Python package for DuckDB can run queries directly from Pandas DataFrame without copying any data.
Why is it called DuckDB? We let the folks who created it explain it:
Ducks are amazing animals. They can fly, walk and swim. They can also live off pretty much everything. They are quite resilient to environmental challenges. A duck’s song will bring people back from the dead and inspires database research. They are thus the perfect mascot for a versatile and resilient data management system. Also the logo designs itself.
For the Python user, installation is as simple as with any other Python package:
pip install duckdb
If you install with conda, the command is
conda install python-duckdb -c conda-forge
DuckDB is an in-memory database. If you want to persist the database across Python sessions, then you need to create a connection that stores the data in a single file on disk. Otherwise, the database will go away when the Python session ends. You can work without a connection to a database by using methods on the duckdb module. For example, the following statements directly query a CSV file using DuckDB:
import duckdb
'select * from "../datasets/fitness.csv" where Age > 50') duckdb.sql(
┌───────┬────────┬────────┬─────────┬───────────┬──────────┬──────────┐
│ Age │ Weight │ Oxygen │ RunTime │ RestPulse │ RunPulse │ MaxPulse │
│ int64 │ double │ double │ double │ int64 │ int64 │ int64 │
├───────┼────────┼────────┼─────────┼───────────┼──────────┼──────────┤
│ 54 │ 83.12 │ 51.855 │ 10.33 │ 50 │ 166 │ 170 │
│ 51 │ 69.63 │ 40.836 │ 10.95 │ 57 │ 168 │ 172 │
│ 51 │ 77.91 │ 46.672 │ 10.0 │ 48 │ 162 │ 168 │
│ 57 │ 73.37 │ 39.407 │ 12.63 │ 58 │ 174 │ 176 │
│ 54 │ 79.38 │ 46.08 │ 11.17 │ 62 │ 156 │ 165 │
│ 52 │ 76.32 │ 45.441 │ 9.63 │ 48 │ 164 │ 166 │
│ 51 │ 67.25 │ 45.118 │ 11.08 │ 48 │ 172 │ 172 │
│ 54 │ 91.63 │ 39.203 │ 12.88 │ 44 │ 168 │ 172 │
│ 51 │ 73.71 │ 45.79 │ 10.47 │ 59 │ 186 │ 188 │
│ 57 │ 59.08 │ 50.545 │ 9.93 │ 49 │ 148 │ 155 │
│ 52 │ 82.78 │ 47.467 │ 10.5 │ 53 │ 170 │ 172 │
├───────┴────────┴────────┴─────────┴───────────┴──────────┴──────────┤
│ 11 rows 7 columns │
└─────────────────────────────────────────────────────────────────────┘
The object returned by duckdb is a relation, a symbolic representation of the query that can be referenced in other queries. The next statements save the result of the previous query and reference the relation in the second query:
= duckdb.sql('select * from "../datasets/fitness.csv" where Age > 50')
rel "select Oxygen, RunTime, RestPulse from rel where RestPulse < 49") duckdb.sql(
┌────────┬─────────┬───────────┐
│ Oxygen │ RunTime │ RestPulse │
│ double │ double │ int64 │
├────────┼─────────┼───────────┤
│ 46.672 │ 10.0 │ 48 │
│ 45.441 │ 9.63 │ 48 │
│ 45.118 │ 11.08 │ 48 │
│ 39.203 │ 12.88 │ 44 │
└────────┴─────────┴───────────┘
The results of a DuckDB query can be easily converted into other formats. fetchnumpy()
fetches data as a dictionary of NumPy arrays, df()
fetches a Pandas DataFrame, pl()
fetches a Polars DataFrame, and arrow()
fetches an Arrow table.
"select Oxygen, RunTime, RestPulse from rel where RestPulse < 49").df() duckdb.sql(
Oxygen | RunTime | RestPulse | |
---|---|---|---|
0 | 46.672 | 10.00 | 48 |
1 | 45.441 | 9.63 | 48 |
2 | 45.118 | 11.08 | 48 |
3 | 39.203 | 12.88 | 44 |
Reading files
To read the major data science file formats into DuckDB, use the read_csv()
, read_parquet()
, read_json()
, and read_arrow()
methods. For example, to read the simple JSON file from the previous section:
'../datasets/JSON/simple.json') duckdb.read_json(
┌─────────┬─────────┬───────┬────────────┬────────┐
│ id │ name │ math │ statistics │ weight │
│ varchar │ varchar │ int64 │ int64 │ double │
├─────────┼─────────┼───────┼────────────┼────────┤
│ A00123 │ David │ 70 │ 86 │ 156.3 │
│ B00422 │ Andrew │ 89 │ 80 │ 210.6 │
│ C004543 │ Tobias │ 79 │ 90 │ 167.0 │
└─────────┴─────────┴───────┴────────────┴────────┘
To read a directory of Parquet files, you can use wildcard syntax in the path of the read_parquet()
function:
'../datasets/Parquet/userdata/user*.parquet') duckdb.read_parquet(
┌─────────────────────┬───────┬────────────┬───┬────────────┬───────────┬──────────────────────┬──────────────────────┐
│ registration_dttm │ id │ first_name │ … │ birthdate │ salary │ title │ comments │
│ timestamp │ int32 │ varchar │ │ varchar │ double │ varchar │ varchar │
├─────────────────────┼───────┼────────────┼───┼────────────┼───────────┼──────────────────────┼──────────────────────┤
│ 2016-02-03 16:07:46 │ 1 │ Ernest │ … │ │ 140639.36 │ │ │
│ 2016-02-03 21:52:07 │ 2 │ Anthony │ … │ 1/16/1998 │ 172843.61 │ Developer II │ 👾 🙇 💁 🙅 🙆 🙋 … │
│ 2016-02-03 02:22:19 │ 3 │ Ryan │ … │ 11/21/1978 │ 204620.66 │ Developer I │ ␢ │
│ 2016-02-03 04:20:04 │ 4 │ Brenda │ … │ 10/29/1998 │ 260474.12 │ GIS Technical Arch… │ │
│ 2016-02-03 00:15:16 │ 5 │ Jacqueline │ … │ 7/12/1959 │ 286038.78 │ Marketing Assistant │ │
│ 2016-02-03 19:48:14 │ 6 │ Paul │ … │ │ 241518.24 │ │ │
│ 2016-02-03 08:59:05 │ 7 │ Linda │ … │ 3/30/1988 │ 192756.38 │ Professor │ │
│ 2016-02-03 08:04:51 │ 8 │ Frances │ … │ │ 188511.28 │ │ <svg><script>0<1>a… │
│ 2016-02-03 08:12:33 │ 9 │ Jason │ … │ 7/29/1982 │ 238068.56 │ Web Designer III │ │
│ 2016-02-03 17:08:02 │ 10 │ Carolyn │ … │ 4/28/1977 │ 132718.26 │ Research Nurse │ │
│ · │ · │ · │ · │ · │ · │ · │ · │
│ · │ · │ · │ · │ · │ · │ · │ · │
│ · │ · │ · │ · │ · │ · │ · │ · │
│ 2016-02-04 04:48:13 │ 991 │ Fred │ … │ 1/25/1975 │ 280835.07 │ Cost Accountant │ │
│ 2016-02-04 00:03:33 │ 992 │ Anna │ … │ 5/29/1962 │ 286181.88 │ Automation Special… │ │
│ 2016-02-04 07:05:48 │ 993 │ Donna │ … │ 6/3/1962 │ 245704.0 │ Senior Financial A… │ │
│ 2016-02-04 21:15:41 │ 994 │ Annie │ … │ │ 194931.47 │ │ │
│ 2016-02-04 05:05:12 │ 995 │ Carlos │ … │ 8/11/1986 │ 39424.88 │ Teacher │ │
│ 2016-02-04 21:45:05 │ 996 │ Stephanie │ … │ 5/9/1962 │ 273504.1 │ Chief Design Engin… │ │
│ 2016-02-04 09:20:21 │ 997 │ Kathy │ … │ 12/4/1986 │ 109525.48 │ Director of Sales │ │
│ 2016-02-04 00:26:05 │ 998 │ Louis │ … │ 11/20/1982 │ 13134.47 │ Office Assistant IV │ │
│ 2016-02-04 04:50:33 │ 999 │ Elizabeth │ … │ 3/14/1976 │ 207194.65 │ Research Assistant I │ │
│ 2016-02-04 16:14:42 │ 1000 │ Susan │ … │ 12/19/1999 │ 229961.89 │ Human Resources Ma… │ │
├─────────────────────┴───────┴────────────┴───┴────────────┴───────────┴──────────────────────┴──────────────────────┤
│ 5000 rows (20 shown) 13 columns (7 shown) │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Connecting to the database
If you make an explicit connection to the database with the connect()
method, the arguments determine whether the database resides completely in memory—and is destroyed when the Python process exits—or whether the database persists on disk. The following statements are equivalent and create a non-persistent database:
conn = duckdb.connect(database=":default:")
conn = duckdb.connect(database=":memory:")
conn = duckdb.connect(":memory:")
conn = duckdb.connect()
Once the connection is established you use the same methods on the connection object as on the duckdb object:
= duckdb.connect()
conn "select Oxygen, RunTime, RestPulse from rel where RestPulse < 49").df() conn.sql(
Oxygen | RunTime | RestPulse | |
---|---|---|---|
0 | 46.672 | 10.00 | 48 |
1 | 45.441 | 9.63 | 48 |
2 | 45.118 | 11.08 | 48 |
3 | 39.203 | 12.88 | 44 |
To create or access a persisted database, specify the name of the database using a .db
, .ddb
, or .duckdb
extension. The following statements create or open the database test.db
, read a CSV file into a Pandas DataFrame and create the table fitness from the DataFrame in the database.
= duckdb.connect(database="test.db")
conn = conn.read_csv('../datasets/fitness.csv').df()
df 'CREATE OR REPLACE TABLE fitness AS SELECT * FROM df') conn.execute(
<duckdb.duckdb.DuckDBPyConnection at 0x168f82270>
You can see the tables in a database by querying the information schema:
'select * from information_schema.tables') conn.sql(
┌───────────────┬──────────────┬────────────┬────────────┬───┬────────────────────┬──────────┬───────────────┐
│ table_catalog │ table_schema │ table_name │ table_type │ … │ is_insertable_into │ is_typed │ commit_action │
│ varchar │ varchar │ varchar │ varchar │ │ varchar │ varchar │ varchar │
├───────────────┼──────────────┼────────────┼────────────┼───┼────────────────────┼──────────┼───────────────┤
│ test │ main │ fitness │ BASE TABLE │ … │ YES │ NO │ NULL │
├───────────────┴──────────────┴────────────┴────────────┴───┴────────────────────┴──────────┴───────────────┤
│ 1 rows 12 columns (7 shown) │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
or with
"show tables") conn.sql(
┌─────────┐
│ name │
│ varchar │
├─────────┤
│ fitness │
└─────────┘
And you can see the available databases with
"show databases") conn.sql(
┌───────────────┐
│ database_name │
│ varchar │
├───────────────┤
│ test │
└───────────────┘
You could have created the table in the database from the CSV file in a single step:
= duckdb.connect(database="ads5064.db")
conn 'CREATE TABLE fitness AS SELECT * FROM “../data/fitness.csv” ') conn.execute(
Notice the use of single and double quotes to distinguish the string for the file name inside the SQL statement.
To close the connection from the Python session to the database, use
conn.close()
11.3 Cloud Data Resources
Cloud resources are increasingly important for data storage and analytics. The growth trends in the cloud are staggering. Storing data in cloud file, block, or object storage is a $78.6 billion market with a compound annual growth rate (CAGR) of 18.5% (2022 numbers). Cloud databases in 2023 are a $21.3 billion market that grows at 22%.
You will encounter cloud-based data in your data science work, and you will take advantage of cloud resources, especially for handling and analyzing large data sets.
The most important cloud storage solutions for data are
- Object storage
- Cloud databases
- Cloud data warehouses (Snowflake, BigQuery)
- Cloud lakehouse (Databricks)
Object storage
Object, file, and block storage work differently. In object storage, data is managed as generic objects. These can be files or parts of files and combine the data to be stored with metadata about the object being stored. Metadata could be, for example, a unique identifier, size, creation date, time, and location for an image. Objects are stored in a flat structure; the object’s identifier is used to locate it in the storage system. Amazon Web Services (AWS) calls the container objects are stored in a bucket (rather than a directory).
File storage is what you are used to from a directory-based file system on your computer’s operating system. Data is stored in files, organized in hierarchical folders. You look up data by traversing the directory until you have located the file with the correct name.
Block storage breaks data into fixed-sized blocks and stores the blocks as units. The Hadoop Distributed File System (HDFS) is a block-oriented storage system, where data is split into 16 MB blocks and the blocks are replicated on multiple machines in a distributed system for durability and high availability.
Cloud object storage has nearly replaced other Big Data storage solutions such as Hadoop and HDFS. You can still find Hadoop environments in organizations—lots of money was invested in building those and organizations are slow to migrate off Hadoop. But the writing is on the wall: HDFS is replaced with cloud storage because it is cheaper and more durable, in particular cloud object storage. Since object storage supports storing files as objects, is the most economical storing alternative, and can scale out without limits, it is the primary storage form for fie-based data in the cloud. Even cloud data warehouses have been built with cloud object storage as the primary storage layer.
The object storage systems of the three major cloud providers are
Amazon S3 (Simple Storage Service) from AWS, Azure Blob Storage, and Google Cloud Storage. Common file formats for you find in these storage systems are CSV, JSON, and Parquet.
Managed cloud database
TODO
MotherDuck serverless cloud database
The convenience of a serverless database is not having to worry about the cloud instances the database is running on. With a local, on-premises database, you manage the storage and compute infrastructure—the machines—the database is running on. With a managed service (DBaaS), this aspect is taken care of by the managed service provider. However, you still must manage your database instances. For example, during periods of low usage, you might want to scale down the database or even shut it down to reduce cloud expenses.
Example: Please Turn Off Your Instances
A common mistake by cloud newbies is to assume that just because you are not actively using the cloud, you are not incurring expenses. When you start an instance in the cloud you need to stop it when it is not in use. Otherwise, it keeps running. Even if there is no workload on the instance, it will cost you money.
A class at a major university—which shall not be named—used cloud resources and the instructors forgot to tell students to shut down their cloud instances. Instead, the students started new instances every time they needed to do cloud computing. The academic unit conducting the class had allocated a $40,000 cloud computing budget over the next three years. The class spent $140,000 in one month.
A serverless system automatically scales to zero when not in use for a while. With serverless systems you tend to pay only for what you use, when you use it. An exception are charges for storing data, you cannot turn the storage off. The MotherDuck serverless database is a great complement to the DuckDB analytic database. In fact, MotherDuck is built on DuckDB and you can think of it as the cloud version of DuckDB. That makes MotherDuck a good choice for analytical workloads since DuckDB is a relational column-store, an analytic database.
Another nice feature of MotherDuck is the ability to connect to local databases and to cloud databases at the same time. This hybrid mode is useful if you have large data sets in cloud storage and small-to-large data sets stored locally. Since we already know how to work with DuckDB from Python, working with MotherDuck is a snap. The following statements create a connection to a MotherDuck serverless database:
import duckdb
= duckdb.connect('md:') con
Notice that we use the same duckdb library as previously. The special string ‘md:’
or ‘motherduck:’
inside the connect()
function triggers a connection attempt to MotherDuck. Whether the connection attempt succeeds depends on whether you can authenticate to MotherDuck. By default, the connect()
function will open a browser through which you can authenticate to the MotherDuck service using a previously established account.
Alternatively, you can click on Settings in the UI of the MotherDuck console at https://app.motherduck.com/ and download your authentication token. Store this token in the environment variable motherduck_token on your system and you can authenticate and connect directly to MotherDuck with the previous duckdb.connect()
statement. It is also possible to list the authentication token directly in the duckdb.connect()
function,
= duckdb.connect('md:?motherduck_token=<token>') con
but this is discouraged. Access credentials should not be shown in clear text in a source file that can easily be read and that might be shared with others. That is a security nightmare waiting to happen.
DuckDB and MotherDuck are relational database management systems, they manage more than one database. You can see the list of databases with the SHOW DATABASES command:
"SHOW DATABASES") con.sql(
By default, MotherDuck creates a database called my_db
and a shared database with sample data. One of the cool features of MotherDuck is the ability to attach local databases to the instance. The following statement attaches a local version of the database for this course to MotherDuck.
"ATTACH 'ads5064.ddb'")
con.sql("SHOW DATABASES") con.sql(
Since MotherDuck manages multiple databases you have to tell it which database to work with. This is called the current database, by default this is my_db
when you first connect. You can see which database is current with
"SELECT current_database()").show() con.sql(
You can set any database as the current one with the USE statement:
"USE ads5064")
con.sql("SHOW TABLES") con.sql(
You can access a table in a database that is not the current database by using a two-level name (database.table). This even works if one table is local and the other is in a cloud database. Suppose that ads5064 is a local version of the database for the course and that ads5064_md
is stored in the cloud.
"SELECT count(*) FROM ads5064.apples")
con.sql("SELECT count(*) FROM ads5064_md.glaucoma") con.sql(
This comes in very handy if you want to process local and cloud data in the same query, for example when joining tables.
MotherDuck also supports shared read-only databases, to which you connect through a URL. The URL is created by an administrator of the database with the CREATE SHARE command. You can then attach the database by using the URL:
"ATTACH 'md:_share/ads5064_md/e00063c9-568c-45ec-a372-dca85f6915fd' as course_db") con.sql(
The URL is a link to a point-in-time snapshot of the database when it was shared. The administrator can update the share of the database, the link will remain valid.