14 Data Access

In this section we use Python libraries to access data in the major file formats presented in a previous chapter and convert the files into DataFrames: CSV, JSON, Parquet, and ORC files. The pandas DataFrame format is common but there are others to consider:

Polars DataFrames are highly parallelized, support more efficient string handling than Pandas, and allow both lazy and eager execution.
Arrow tables have highly efficient in-memory columnar format for fast analytics.

Python has built-in functions for reading data formats, for example, csv.reader() for CSV files, we use the pandas and Pyarrow packages to read and write files in common formats. Pandas supports an impressive list of I/O formats and data sets likely end up as pandas DataFrames anyway, so we might as well convert from file format directly to a DataFrame. The Pyarrow library is becoming increasingly important in data engineering for data science; Pandas uses the Pyarrow engine for several of the important formats, e.g., parquet files. In pandas 2.0 you can choose whether to use a NumPy-based engine or a Pyarrow-based engine for data manipulation and analysis.

Fortunately, the methods for reading data in pandas and polars are very similar, so it is relatively easy to switch from one DataFrame representation to another. For example, the following statements show how to read a CSV file into a pandas and a polars DataFrame.

import pandas as pd
import polars as pl

pandas_df = pd.read_csv("../datasets/iris.csv")
polars_df = pl.read_csv("../datasets/iris.csv")

Once you start working with DataFrames the syntax can be quite different, however.

14.1 Accessing Local Files

CSV files

To convert CSV files to a pandas DataFrame use the pandas read_csv() method. The default behavior is to look for a header line with column names, to interpret commas as field delimiters, to skip blank lines in the file, and to apply utf-8 encoding.

Consider the following file, fitness.csv:

data> head fitness.csv
Age,Weight,Oxygen,RunTime,RestPulse,RunPulse,MaxPulse
44,89.47,44.609,11.37,62,178,182
40,75.07,45.313,10.07,62,185,185
44,85.84,54.297, 8.65,45,156,168
42,68.15,59.571, 8.17,40,166,172
38,89.02,49.874, 9.22,55,178,180

The first line contains comma-separated names for the columns. The following lines contain the data for the attributes, one record per line. To convert the CSV file to a pandas DataFrame use the read_csv() method:

import numpy as np

df = pd.read_csv("../datasets/fitness.csv")
print("Default result from read_csv")
print(df.head())
df.describe(include=[np.float64])

Default result from read_csv
   Age  Weight  Oxygen  RunTime  RestPulse  RunPulse  MaxPulse
0   44   89.47  44.609    11.37         62       178       182
1   40   75.07  45.313    10.07         62       185       185
2   44   85.84  54.297     8.65         45       156       168
3   42   68.15  59.571     8.17         40       166       172
4   38   89.02  49.874     9.22         55       178       180

	Weight	Oxygen	RunTime
count	31.000000	31.000000	31.000000
mean	77.444516	47.375806	10.586129
std	8.328568	5.327231	1.387414
min	59.080000	37.388000	8.170000
25%	73.200000	44.964500	9.780000
50%	77.450000	46.774000	10.470000
75%	82.325000	50.131000	11.270000
max	91.630000	60.055000	14.030000

You can assign your own variable names with the names= option:

df2 = pd.read_csv("../datasets/fitness.csv", 
                  header=0,
                  names=["Age", "Wgt","Oxy","RT","ReP","RuP","MP" ])
df2.head()

	Age	Wgt	Oxy	RT	ReP	RuP	MP
0	44	89.47	44.609	11.37	62	178	182
1	40	75.07	45.313	10.07	62	185	185
2	44	85.84	54.297	8.65	45	156	168
3	42	68.15	59.571	8.17	40	166	172
4	38	89.02	49.874	9.22	55	178	180

String variables are indicated in the CSV file with quotes:

data>  head herding-cats.csv
"address_full","street","coat","sex","age","weight","fixed","wander_dist","roamer","cat_id"
"15 Fillmer Ave Los Gatos 95030","15 Fillmer Ave","brown","male",5.594,5.016,1,0.115,"yes",1
"244 Harding Ave Los Gatos 95030","244 Harding Ave","tabby","male",6.852,6.314,1,0.129,"yes",2
"16570 Marchmont Dr Los Gatos 95032","16570 Marchmont Dr","maltese","female",4.081, 2.652,0, 0.129,"yes",3
"100 Stonybrook Rd Los Gatos 95032","100 Stonybrook Rd","tabby","female",3.806,3.413,1, 0.107,"yes",4

cats = pd.read_csv("../datasets/herding-cats.csv",
                   index_col="cat_id")
print(cats.head())

                              address_full              street     coat  \
cat_id                                                                    
1           15 Fillmer Ave Los Gatos 95030      15 Fillmer Ave    brown   
2          244 Harding Ave Los Gatos 95030     244 Harding Ave    tabby   
3       16570 Marchmont Dr Los Gatos 95032  16570 Marchmont Dr  maltese   
4        100 Stonybrook Rd Los Gatos 95032   100 Stonybrook Rd    tabby   
5           266 Kennedy Rd Los Gatos 95032      266 Kennedy Rd   calico   

           sex    age  weight  fixed  wander_dist roamer  
cat_id                                                    
1         male  5.594   5.016      1        0.115    yes  
2         male  6.852   6.314      1        0.129    yes  
3       female  4.081   2.652      0        0.129    yes  
4       female  3.806   3.413      1        0.107    yes  
5         male  5.164   6.820      1        0.039     no

Pandas is smart about assigning data types to columns, note that string columns are stored as data type object.

cats.info()

<class 'pandas.core.frame.DataFrame'>
Index: 400 entries, 1 to 400
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   address_full  400 non-null    object 
 1   street        394 non-null    object 
 2   coat          400 non-null    object 
 3   sex           400 non-null    object 
 4   age           400 non-null    float64
 5   weight        400 non-null    float64
 6   fixed         400 non-null    int64  
 7   wander_dist   400 non-null    float64
 8   roamer        400 non-null    object 
dtypes: float64(3), int64(1), object(5)
memory usage: 31.2+ KB

You can overwrite the data types by supplying either a default type or a dictionary of types. To change the data type for the fixed column, for example:

cats = pd.read_csv("../datasets/herding-cats.csv",
                   dtype={'fixed':np.float16})

Missing values (see below) can be indicated in CSV files in several ways. The common technique is to leave the entry for the unobserved value empty (no spaces). In the following file the value for appraisal is missing in the third row. The fifth row contains values for land and appraisal only.

data> head landsales2.csv
land, improve, total, sale, appraisal
30000,64831,94831,118500,1.25
30000,50765,80765,93900,1.16
46651,18573,65224,,
45990,91402,137392,184000,1.34
42394,,,168000,
,133351,,169000,
63596,2182,65778,,
56658,153806,210464,255000,1.21

read_csv() converts this file correctly to missing values using the NaN (not-a-number) sentinel for the unobserved data:

land = pd.read_csv("../datasets/landsales2.csv")
land

	land	improve	total	sale	appraisal
0	30000.0	64831.0	94831.0	118500.0	1.25
1	30000.0	50765.0	80765.0	93900.0	1.16
2	46651.0	18573.0	65224.0	NaN	NaN
3	45990.0	91402.0	137392.0	184000.0	1.34
4	42394.0	NaN	NaN	168000.0	NaN
5	NaN	133351.0	NaN	169000.0	NaN
6	63596.0	2182.0	65778.0	NaN	NaN
7	56658.0	153806.0	210464.0	255000.0	1.21
8	51428.0	72451.0	123879.0	NaN	NaN
9	93200.0	NaN	NaN	422000.0	NaN
10	76125.0	78172.0	275297.0	290000.0	1.14
11	154360.0	61934.0	216294.0	237000.0	1.10
12	65376.0	NaN	NaN	286500.0	NaN
13	42400.0	NaN	NaN	NaN	NaN
14	40800.0	92606.0	133406.0	168000.0	1.26

In addition to empty entries, read.csv() interprets a number of entries as missing (unobserved) values: “#N/A”, “#N/A N/A”, “#NA”, “-1.#IND”, “-1.#QNAN”, “-NaN”, “-nan”, “1.#IND”, “1.#QNAN”, “<NA>”, “N/A”, “NA”, “NULL”, “NaN”, “None”, “n/a”, “nan”, “null “. Despite those, CSV files might have special values to indicate missing values, for example, when they are exported from other software packages. Suppose that the missing values are indicated by a dot (“.”). Also, the columns in the CSV file are formatted with white space:

data> head landsales.csv
land, improve, total, sale, appraisal
30000,     64831,     94831,    118500,   1.25
30000,     50765,     80765,     93900,   1.16
46651,     18573,     65224,         .,    .
45990,     91402,    137392,    184000,   1.34
42394,         .,         .,    168000,    .
    .,    133351,         .,    169000,    .
63596,      2182,     65778,         .,    .
56658,    153806,    210464,    255000,   1.21
51428,     72451,    123879,         .,    .

The file is much more human readable, but requires additional options to convert to the intended DataFrame:

land2 = pd.read_csv("../datasets/landsales.csv", 
                   na_values=".",
                   skipinitialspace=True)
land2.head()

	land	improve	total	sale	appraisal
0	30000.0	64831.0	94831.0	118500.0	1.25
1	30000.0	50765.0	80765.0	93900.0	1.16
2	46651.0	18573.0	65224.0	NaN	NaN
3	45990.0	91402.0	137392.0	184000.0	1.34
4	42394.0	NaN	NaN	168000.0	NaN

The na.values= option specifies additional values that should be recognized as missing values. The skipinitalspace= option tellsread_csv() to ignore spaces following the delimiters. The result is as intended.

JSON files

Pandas has great support for reading and writing JSON files. A general issue is that JSON is a hierarchical data format that allows nested data structures whereas pandas DataFrames have a row—column layout. Fortunately, the pandasread_json() method has many options to shape nested JSON structures into DataFrames.

You can convert any JSON document into a Python dictionary with the load() function in the json library. Suppose we want to read this document, stored in ../datasets/JSON/simple1.json into Python:

> cat ../datasets/JSON/simple1.json
{
    "firstName": "Jane",
    "lastName": "Doe",
    "hobbies": ["running", "sky diving", "singing"],
    "age": 35,
    "children": [
        {
            "firstName": "Alice",
            "age": 6
        },
        {
            "firstName": "Bob",
            "age": 8
        }
    ]
}

import numpy as np
import pandas as pd
import json
with open("../datasets/JSON/simple1.json","r") as f:
    data = json.load(f)
data

data

{'firstName': 'Jane',
 'lastName': 'Doe',
 'hobbies': ['running', 'sky diving', 'singing'],
 'age': 35,
 'children': [{'firstName': 'Alice', 'age': 6},
  {'firstName': 'Bob', 'age': 8}]}

This converts the contents of the file into a dictionary, which we then have to further parse. See below how to do this with pd.json_normalize.

When the JSON data are arranged in simple records, we can use pd.read_json to convert the JSON structures directly into a DataFrame. For example, the file ../datasets/JSON/simple.json consists of a series of records with fields id, name, math, statistics, and weight.

cat simple.json 
[
  {
    "id": "A00123",
    "name": "David",
    "math": 70,
    "statistics": 86,
    "weight": 156.3
  },
  {
    "id": "B00422",
    "name": "Andrew",
    "math": 89,
    "statistics": 80,
    "weight": 210.6
  },
  {
    "id": "C004543",
    "name": "Tobias",
    "math": 79,
    "statistics": 90,
    "weight": 167.0
  }
]

df = pd.read_json("../datasets/JSON/simple.json")
df.info()
df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   id          3 non-null      object 
 1   name        3 non-null      object 
 2   math        3 non-null      int64  
 3   statistics  3 non-null      int64  
 4   weight      3 non-null      float64
dtypes: float64(1), int64(2), object(2)
memory usage: 252.0+ bytes

	id	name	math	statistics	weight
0	A00123	David	70	86	156.3
1	B00422	Andrew	89	80	210.6
2	C004543	Tobias	79	90	167.0

If you’d rather assign data type string to the id and name fields than data type object, you can set the data type with

df2 = pd.read_json("../datasets/JSON/simple.json", 
                   dtype= {"id": "string", "name" : "string"})
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   id          3 non-null      string 
 1   name        3 non-null      string 
 2   math        3 non-null      int64  
 3   statistics  3 non-null      int64  
 4   weight      3 non-null      float64
dtypes: float64(1), int64(2), string(2)
memory usage: 252.0 bytes

Let’s go back to the question on how to handle a nested JSON structure? The data in ../datasets/JSON/nested_array.json is

data> cat nested_array.json
{
    "school_name": "Virginia Tech",
    "class": "Class of 2026",
    "students": [
      {
        "id": "A00123",
        "name": "David",
        "math": 70,
        "statistics": 86,
        "weight": 156.3
      },
      {
        "id": "B00422",
        "name": "Andrew",
        "math": 89,
        "statistics": 80,
        "weight": 210.6
      },
      {
        "id": "C004543",
        "name": "Tobias",
        "math": 79,
        "statistics": 90,
        "weight": 167.0
      }]
}

The students fields have a record structure but there are additional fields school_name and class that do not fit that structure. Converting the JSON file directly into a DataFrame does not produce the desired result:

df3 = pd.read_json('../datasets/JSON/nested_array.json')
df3

	school_name	class	students
0	Virginia Tech	Class of 2026	{'id': 'A00123', 'name': 'David', 'math': 70, ...
1	Virginia Tech	Class of 2026	{'id': 'B00422', 'name': 'Andrew', 'math': 89,...
2	Virginia Tech	Class of 2026	{'id': 'C004543', 'name': 'Tobias', 'math': 79...

To create the appropriate DataFrame we take two steps: convert the JSON file into a dictionary, then shape the dictionary into a DataFrame with json_normalize():

with open('../datasets/JSON/nested_array.json','r') as f:
    data = json.load(f)

df4 = pd.json_normalize(data, 
                        record_path =["students"],
                        meta=["school_name", "class"])
df4

	id	name	math	statistics	weight	school_name	class
0	A00123	David	70	86	156.3	Virginia Tech	Class of 2026
1	B00422	Andrew	89	80	210.6	Virginia Tech	Class of 2026
2	C004543	Tobias	79	90	167.0	Virginia Tech	Class of 2026

The meta= option lists the JSON fields that are used as data that applies to each record in the result table. The record_path= option points at the field that contains the list of records.

Now suppose that in addition to the list of records the JSON metadata contains a more complex structure:

cat nested_obj_array.json
{
   "school_name": "Virginia Tech",
   "class": "Class of 2026",
    "info": {
      "president": "Timothy D. Sands",
      "address": "Office of the President",
      "social": {
        "LinkedIn": "tim-sands-49b8b95",
        "Twitter": "@VTSandsman"
      }
    },
    "students": [
      {
        "id": "A00123",
        "name": "David",
        "math": 70,
        "statistics": 86,
        "weight": 156.3
      },
. . .
      }]
}

You can specify which fields to extract from the metadata in the meta= option of json_normalize():

with open('../datasets/JSON/nested_obj_array.json','r') as f:
    data = json.load(f)

df5 = pd.json_normalize(data,
                       record_path =["students"], 
                       meta=["school_name","class",
                             ["info", "president"], 
                             ["info", "social", "Twitter"]])
df5

	id	name	math	statistics	weight	school_name	class	info.president	info.social.Twitter
0	A00123	David	70	86	156.3	Virginia Tech	Class of 2026	Timothy D. Sands	@VTSandsman
1	B00422	Andrew	89	80	210.6	Virginia Tech	Class of 2026	Timothy D. Sands	@VTSandsman
2	C004543	Tobias	79	90	167.0	Virginia Tech	Class of 2026	Timothy D. Sands	@VTSandsman

How are missing values handled? In the next example, the first has a complete record. The second student misses a statistics score, the third student has no weight data. pd.json_normalize() fills in NaN for the unobserved data.

data = [
    {
        "students": [
        {
            "id": "A00123",
            "name": "David",
            "math": 70,
            "statistics": 86,
            "weight": 156.3
        },
        {
            "id": "B00422",
            "name": "Andrew",
            "math": 89,
            "weight": 210.6
          },
        {
            "id": "C004543",
            "name": "Tobias",
            "math": 79,
            "statistics": 90,
        }
        ]
}
]

df6 = pd.json_normalize(data, record_path =["students"])
df6

	id	name	math	statistics	weight
0	A00123	David	70	86.0	156.3
1	B00422	Andrew	89	NaN	210.6
2	C004543	Tobias	79	90.0	NaN

json_normalize() is less forgiving if fields in the metadata are unobserved; it will throw an error. To prevent the error and assign NaN values, use the errors=’ignore’ option. In the following example the teachers meta data field does not always have a l_name entry.

data = [
    { 
        'class': 'ADS 5525', 
        'student count': 60, 
        'room': 'Yellow',
        'teachers': { 
                'f_name': 'John', 
                'l_name': 'Doughy',
            },
        'students': [
            { 'name': 'Tom', 'sex': 'M' },
            { 'name': 'James', 'sex': 'M' },
        ]
    },
    { 
        'class': 'ADS 5526', 
        'student count': 45, 
        'room': 'Blue',
         'teachers': { 
                'f_name': 'Albert'
            },
        'students': [
            { 'name': 'Tony', 'sex': 'M' },
            { 'name': 'Jacqueline', 'sex': 'F' },
        ]
    },
]

df7 = pd.json_normalize(data, 
                        record_path =["students"], 
                        meta=["class", "room", ["teachers", "l_name"]],
                        errors='ignore')

df7

	name	sex	class	room	teachers.l_name
0	Tom	M	ADS 5525	Yellow	Doughy
1	James	M	ADS 5525	Yellow	Doughy
2	Tony	M	ADS 5526	Blue	NaN
3	Jacqueline	F	ADS 5526	Blue	NaN

Parquet files

To read parquet files into pandas DataFrames we have two choices: the read_parquet() pandas method or the read_table() method from pyarrow.parquet.

The data we work with in this section is stored in a parquet multi-file structure; that means the data are split into multiple files stored in a local directory:

Parquet ll userdata
total 1120
-rw-r--r--@ 1 olivers  staff   111K Aug 31 14:36 userdata1.parquet
-rw-r--r--@ 1 olivers  staff   110K Aug 31 14:36 userdata2.parquet
-rw-r--r--@ 1 olivers  staff   111K Aug 31 14:37 userdata3.parquet
-rw-r--r--@ 1 olivers  staff   110K Aug 31 14:37 userdata4.parquet
-rw-r--r--@ 1 olivers  staff   111K Aug 31 14:37 userdata5.parquet

Each of the *.parquet files in the directory contains 1,000 observations.

Pandas’ read_parquet() reads the entire directory and combines the records from the multiple files into a single DataFrame.

df = pd.read_parquet("../datasets/Parquet/userdata", engine="pyarrow")
df.info()
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   registration_dttm  5000 non-null   datetime64[ns]
 1   id                 4999 non-null   float64       
 2   first_name         5000 non-null   object        
 3   last_name          5000 non-null   object        
 4   email              5000 non-null   object        
 5   gender             5000 non-null   object        
 6   ip_address         5000 non-null   object        
 7   cc                 5000 non-null   object        
 8   country            5000 non-null   object        
 9   birthdate          5000 non-null   object        
 10  salary             4689 non-null   float64       
 11  title              5000 non-null   object        
 12  comments           4966 non-null   object        
dtypes: datetime64[ns](1), float64(2), object(10)
memory usage: 507.9+ KB

	registration_dttm	id	salary
count	5000	4999.000000	4689.000000
mean	2016-02-03 21:04:13.699399936	500.598720	150772.222890
min	2016-02-03 00:00:07	1.000000	12068.960000
25%	2016-02-03 08:51:55.750000128	251.000000	83480.380000
50%	2016-02-03 20:01:17	501.000000	152877.190000
75%	2016-02-04 07:42:54.500000	750.500000	215405.220000
max	2016-02-04 23:59:55	1000.000000	286735.820000
std	NaN	288.648331	78171.513062

The size of df confirms that all 5,000 observations were retrieved from the five component files.

With pyarrow we have more control over the files, for example, we can inquire about the schema and metadata of the component files.

import pyarrow.parquet as pq

pq_file = pq.ParquetFile("../datasets/Parquet/userdata/userdata1.parquet")
pq_file.metadata
pq_file.schema

<pyarrow._parquet.ParquetSchema object at 0x30b546d00>
required group field_id=-1 hive_schema {
  optional int96 field_id=-1 registration_dttm;
  optional int32 field_id=-1 id;
  optional binary field_id=-1 first_name (String);
  optional binary field_id=-1 last_name (String);
  optional binary field_id=-1 email (String);
  optional binary field_id=-1 gender (String);
  optional binary field_id=-1 ip_address (String);
  optional binary field_id=-1 cc (String);
  optional binary field_id=-1 country (String);
  optional binary field_id=-1 birthdate (String);
  optional double field_id=-1 salary;
  optional binary field_id=-1 title (String);
  optional binary field_id=-1 comments (String);
}

To convert all component files into a pandas DataFrame with pyarrow, use the read_table() method and send the result to_pandas():

df2 = pq.read_table(source="../datasets/Parquet/userdata/").to_pandas()
df2.describe()

	registration_dttm	id	salary
count	5000	4999.000000	4689.000000
mean	2016-02-03 21:04:13.699399936	500.598720	150772.222890
min	2016-02-03 00:00:07	1.000000	12068.960000
25%	2016-02-03 08:51:55.750000128	251.000000	83480.380000
50%	2016-02-03 20:01:17	501.000000	152877.190000
75%	2016-02-04 07:42:54.500000	750.500000	215405.220000
max	2016-02-04 23:59:55	1000.000000	286735.820000
std	NaN	288.648331	78171.513062

ORC files

Pandas (and pyarrow) supports ORC files but working with ORC files is not quite as convenient as with parquet files. The read_orc() method reads one ORC file rather than an entire directory and it is not supported on Windows.

df = pd.read_orc("../datasets/ORC/userdata/userdata1.orc")

When data is stored in multi-file format, as in the parquet example, you need to concatenate the contents into a pandas DataFrame. The glob module is helpful to retrieve the list of file names that match a pattern.

import glob
folder_path = "../datasets/ORC/userdata"
file_list = glob.glob(folder_path + "/*.orc")
df_from_orc = pd.DataFrame(pd.read_orc(file_list[0]))

for i in range(1,len(file_list)):
    data = pd.read_orc(file_list[i])
    df = pd.DataFrame(data)
    df_from_orc = pd.concat([df_from_orc,df],axis=0)

You can also use the read_table() method for ORC files in pyarrow to read ORC an file into a pyarrow table and to_pandas() to convert the table to a pandas DataFrame:

from pyarrow import orc

df = orc.read_table(source="../datasets/ORC/userdata/userdata1.orc").to_pandas()

14.2 Working with a Database

As a data scientist you need to be comfortable working with databases. In analytical work, you will encounter schema-centric relational OLAP databases that use SQL (structured query language) to interact with the database and the data. SQL interfaces are standard for data warehouses and common for many data platforms and tools.

Data scientists generally prefer writing Python or R code over SQL. SQL is extremely powerful to express relational concepts and can make some data processing tasks much easier than rolling your own code. Also, databases are optimized for joins, merges, deep queries, aggregations, indexes, handling JSON, etc. It would be a shame not to take advantage of that power or to lose it behind the syntactic sugar of a tool’s API. Many employers have voiced concerns that SQL ranks high among the technical skills data scientists often lack. As the founder and chairman of a data science research and consultancy company put it:

There are a lot of things folks don’t agree on. But everyone agrees that as a data scientist you will need to retrieve data from a database, and you will need version control. Learn SQL and learn git.

We are covering SQL language basics in a separate chapter (Chapter 15). Here we discuss some options to work with SQL databases directly from Python.

SQLAlchemy

SQLAlchemy is a SQL toolkit for Python that gives Python developers the power of SQL. While some tools try to hide SQL and relational details behind syntactic sugar, SQLAlchemy’s philosophy is to fully expose the relational details of SQL within transparent tools. Specific databases, beyond those for which support is built into SQLAlchemy, are supported as dialects.

Ibis

Ibis is a portable Python dataframe library that supports multiple database backends. After making a connection to a backend database, you interact with the data using Python syntax rather than SQL statements. The Ibis function calls are translated into SQL for the specific backend under the covers. For example, the first statement in the next code block creates a Python object named penguins through which you can interact with the database table of the same name. The second statement filters the data for a specific species.

penguins = con.table("penguins")
penguins.filter(penguins.species == "Adelie")

The SQL equivalent would be a SELECT statement with a WHERE clause:

SELECT * FROM penguins WHERE species = “Adelie”

Having stable Python code for different database backends makes it easy to move from one database to another. During development of analytic code you might be working with SQLite and the production database is Oracle. The same Python code will work in both environments.

The default backend of Ibis is the embedded database DuckDB. If you are working with DuckDB, you can issue SQL statements directly from Python or you can rely on its dedicated Python API.

DuckDB

DuckDB has all the ingredients we are looking for in a database for data science work:

Designed to support analytical queries (OLAP workloads) characterized by complex queries that process large amounts of data.
Columnar, vectorized storage engine. Columnar storage is more efficient for analytical queries, vectorized queries execute much faster than systems that process rows sequentially.
Provides transactional guarantees (ACID properties)
Deeply integrated with Python or R for efficient data analysis.
Easy to install and operate, without any external dependencies.
Open-source database, free to use.
Can query CSV, JSON, Parquet files and DataFrames directly.
Supports streaming execution to analyze large data sets that exceed memory capacity, e.g., streaming from Parquet files.
In-memory database with graceful on-disk operation.
Easy transition to the cloud through partnership with MotherDuck.
SQL code can be submitted from the CLI or through the Python API, so you can take advantage of the power of SQL.
Runs as an embedded database within a host process.

Because DuckDB is an embedded database, there is no database server to install and to maintain. Because the database runs within the host process, data can be transferred very quickly. And if you want to move the data to the cloud, you can do that easily with the MotherDuck serverless platform.

The Python package for DuckDB can run queries directly from a pandas DataFrame without copying any data.

Why is it called DuckDB? We let the folks who created it explain:

Ducks are amazing animals. They can fly, walk and swim. They can also live off pretty much everything. They are quite resilient to environmental challenges. A duck’s song will bring people back from the dead and inspires database research. They are thus the perfect mascot for a versatile and resilient data management system. Also the logo designs itself.

For the Python user, installation is as simple as with any other Python package:

pip install duckdb

If you install with conda, the command is

conda install python-duckdb -c conda-forge

If you work with R, add the duckdb package to your system with

install.packages("duckdb")

Note

The Python API for DuckDB is more fully developed compared to the R client API. We are using mostly the Python API in this chapter. An example on how to connect to a DuckDB database from R is given below.

DuckDB is an in-memory database. If you want to persist the database across Python or R sessions, then you need to create a connection that stores the data in a single file on disk. Otherwise, the database goes away when the Python session ends.

In Python there are two basic methods to work with the database:

You can call methods on the duckdb module, for example, duckdb.sql(). This technique does not require a connection to a database.
You establish a connection to a database with duckdb.connect() and make calls on the connection handle returned from duckdb.connect().

The following statements directly query a CSV file using DuckDB, no database connection is involved:

import duckdb
duckdb.sql('select * from "../datasets/fitness.csv" where Age > 50')

┌───────┬────────┬────────┬─────────┬───────────┬──────────┬──────────┐
│  Age  │ Weight │ Oxygen │ RunTime │ RestPulse │ RunPulse │ MaxPulse │
│ int64 │ double │ double │ double  │   int64   │  int64   │  int64   │
├───────┼────────┼────────┼─────────┼───────────┼──────────┼──────────┤
│    54 │  83.12 │ 51.855 │   10.33 │        50 │      166 │      170 │
│    51 │  69.63 │ 40.836 │   10.95 │        57 │      168 │      172 │
│    51 │  77.91 │ 46.672 │    10.0 │        48 │      162 │      168 │
│    57 │  73.37 │ 39.407 │   12.63 │        58 │      174 │      176 │
│    54 │  79.38 │  46.08 │   11.17 │        62 │      156 │      165 │
│    52 │  76.32 │ 45.441 │    9.63 │        48 │      164 │      166 │
│    51 │  67.25 │ 45.118 │   11.08 │        48 │      172 │      172 │
│    54 │  91.63 │ 39.203 │   12.88 │        44 │      168 │      172 │
│    51 │  73.71 │  45.79 │   10.47 │        59 │      186 │      188 │
│    57 │  59.08 │ 50.545 │    9.93 │        49 │      148 │      155 │
│    52 │  82.78 │ 47.467 │    10.5 │        53 │      170 │      172 │
├───────┴────────┴────────┴─────────┴───────────┴──────────┴──────────┤
│ 11 rows                                                   7 columns │
└─────────────────────────────────────────────────────────────────────┘

The object returned by duckdb.sql is a relation, a symbolic representation of the query that can be referenced in other queries. The results of a DuckDB query can be easily converted into other formats. fetchnumpy() fetches data as a dictionary of NumPy arrays, df() fetches a Pandas DataFrame, pl() fetches a Polars DataFrame, and arrow() fetches an Arrow table.

duckdb.sql('select Oxygen, RunTime, RestPulse from "../datasets/fitness.csv" where RestPulse < 49').df()

	Oxygen	RunTime	RestPulse
0	54.297	8.65	45
1	59.571	8.17	40
2	60.055	8.63	48
3	50.541	10.13	45
4	47.273	10.60	47
5	49.156	8.95	44
6	46.672	10.00	48
7	46.774	10.25	48
8	45.441	9.63	48
9	54.625	8.92	48
10	45.118	11.08	48
11	39.203	12.88	44

Reading files

To read the major data science file formats into DuckDB, use the read_csv(), read_parquet(), read_json(), and read_arrow() methods. For example, to read the simple JSON file from the previous section:

duckdb.read_json('../datasets/JSON/simple.json')

┌─────────┬─────────┬───────┬────────────┬────────┐
│   id    │  name   │ math  │ statistics │ weight │
│ varchar │ varchar │ int64 │   int64    │ double │
├─────────┼─────────┼───────┼────────────┼────────┤
│ A00123  │ David   │    70 │         86 │  156.3 │
│ B00422  │ Andrew  │    89 │         80 │  210.6 │
│ C004543 │ Tobias  │    79 │         90 │  167.0 │
└─────────┴─────────┴───────┴────────────┴────────┘

To read a directory of Parquet files, you can use wildcard syntax in the path of the read_parquet() function:

duckdb.read_parquet('../datasets/Parquet/userdata/user*.parquet')

┌─────────────────────┬───────┬────────────┬───────────┬────────────────────────────┬─────────┬─────────────────┬─────────────────────┬────────────────────────┬────────────┬───────────┬──────────────────────────┬──────────┐
│  registration_dttm  │  id   │ first_name │ last_name │           email            │ gender  │   ip_address    │         cc          │        country         │ birthdate  │  salary   │          title           │ comments │
│      timestamp      │ int32 │  varchar   │  varchar  │          varchar           │ varchar │     varchar     │       varchar       │        varchar         │  varchar   │  double   │         varchar          │ varchar  │
├─────────────────────┼───────┼────────────┼───────────┼────────────────────────────┼─────────┼─────────────────┼─────────────────────┼────────────────────────┼────────────┼───────────┼──────────────────────────┼──────────┤
│ 2016-02-03 07:55:29 │     1 │ Amanda     │ Jordan    │ ajordan0@com.com           │ Female  │ 1.197.201.2     │ 6759521864920116    │ Indonesia              │ 3/8/1971   │  49756.53 │ Internal Auditor         │ 1E+02    │
│ 2016-02-03 17:04:03 │     2 │ Albert     │ Freeman   │ afreeman1@is.gd            │ Male    │ 218.111.175.34  │                     │ Canada                 │ 1/16/1968  │ 150280.17 │ Accountant IV            │          │
│ 2016-02-03 01:09:31 │     3 │ Evelyn     │ Morgan    │ emorgan2@altervista.org    │ Female  │ 7.161.136.94    │ 6767119071901597    │ Russia                 │ 2/1/1960   │ 144972.51 │ Structural Engineer      │          │
│ 2016-02-03 00:36:21 │     4 │ Denise     │ Riley     │ driley3@gmpg.org           │ Female  │ 140.35.109.83   │ 3576031598965625    │ China                  │ 4/8/1997   │  90263.05 │ Senior Cost Accountant   │          │
│ 2016-02-03 05:05:31 │     5 │ Carlos     │ Burns     │ cburns4@miitbeian.gov.cn   │         │ 169.113.235.40  │ 5602256255204850    │ South Africa           │            │      NULL │                          │          │
│ 2016-02-03 07:22:34 │     6 │ Kathryn    │ White     │ kwhite5@google.com         │ Female  │ 195.131.81.179  │ 3583136326049310    │ Indonesia              │ 2/25/1983  │  69227.11 │ Account Executive        │          │
│ 2016-02-03 08:33:08 │     7 │ Samuel     │ Holmes    │ sholmes6@foxnews.com       │ Male    │ 232.234.81.197  │ 3582641366974690    │ Portugal               │ 12/18/1987 │  14247.62 │ Senior Financial Analyst │          │
│ 2016-02-03 06:47:06 │     8 │ Harry      │ Howell    │ hhowell7@eepurl.com        │ Male    │ 91.235.51.73    │                     │ Bosnia and Herzegovina │ 3/1/1962   │ 186469.43 │ Web Developer IV         │          │
│ 2016-02-03 03:52:53 │     9 │ Jose       │ Foster    │ jfoster8@yelp.com          │ Male    │ 132.31.53.61    │                     │ South Korea            │ 3/27/1992  │ 231067.84 │ Software Test Engineer I │ 1E+02    │
│ 2016-02-03 18:29:47 │    10 │ Emily      │ Stewart   │ estewart9@opensource.org   │ Female  │ 143.28.251.245  │ 3574254110301671    │ Nigeria                │ 1/28/1997  │  27234.28 │ Health Coach IV          │          │
│          ·          │     · │  ·         │  ·        │          ·                 │  ·      │       ·         │        ·            │   ·                    │     ·      │      ·    │        ·                 │    ·     │
│          ·          │     · │  ·         │  ·        │          ·                 │  ·      │       ·         │        ·            │   ·                    │     ·      │      ·    │        ·                 │    ·     │
│          ·          │     · │  ·         │  ·        │          ·                 │  ·      │       ·         │        ·            │   ·                    │     ·      │      ·    │        ·                 │    ·     │
│ 2016-02-04 04:48:13 │   991 │ Fred       │ Cole      │ fcoleri@youtube.com        │ Male    │ 229.120.44.53   │ 5602245042823262    │ China                  │ 1/25/1975  │ 280835.07 │ Cost Accountant          │          │
│ 2016-02-04 00:03:33 │   992 │ Anna       │ Dean      │ adeanrj@netvibes.com       │ Female  │ 113.127.227.85  │ 3586135192218451    │ Vietnam                │ 5/29/1962  │ 286181.88 │ Automation Specialist II │          │
│ 2016-02-04 07:05:48 │   993 │ Donna      │ Peters    │ dpetersrk@unesco.org       │ Female  │ 116.221.168.153 │ 6304720097961887671 │ Mali                   │ 6/3/1962   │  245704.0 │ Senior Financial Analyst │          │
│ 2016-02-04 21:15:41 │   994 │ Annie      │ Simpson   │ asimpsonrl@ucla.edu        │ Female  │ 139.219.108.93  │ 5048372377760992    │ Pakistan               │            │ 194931.47 │                          │          │
│ 2016-02-04 05:05:12 │   995 │ Carlos     │ Peterson  │ cpetersonrm@craigslist.org │ Male    │ 183.118.182.134 │ 3546095461126562    │ Portugal               │ 8/11/1986  │  39424.88 │ Teacher                  │          │
│ 2016-02-04 21:45:05 │   996 │ Stephanie  │ Moore     │ smoorern@shutterfly.com    │ Female  │ 49.66.246.163   │ 374622200543105     │ Brazil                 │ 5/9/1962   │  273504.1 │ Chief Design Engineer    │          │
│ 2016-02-04 09:20:21 │   997 │ Kathy      │ Harper    │ kharperro@flavors.me       │ Female  │ 177.212.255.1   │ 633497162100370275  │ Ukraine                │ 12/4/1986  │ 109525.48 │ Director of Sales        │          │
│ 2016-02-04 00:26:05 │   998 │ Louis      │ Lee       │ lleerp@thetimes.co.uk      │ Male    │ 8.88.141.81     │                     │ Russia                 │ 11/20/1982 │  13134.47 │ Office Assistant IV      │          │
│ 2016-02-04 04:50:33 │   999 │ Elizabeth  │ Lewis     │ elewisrq@china.com.cn      │ Female  │ 174.84.125.234  │ 560224646179525772  │ Guinea                 │ 3/14/1976  │ 207194.65 │ Research Assistant I     │          │
│ 2016-02-04 16:14:42 │  1000 │ Susan      │ Young     │ syoungrr@princeton.edu     │ Female  │ 174.94.41.46    │ 3569756686700901    │ China                  │ 12/19/1999 │ 229961.89 │ Human Resources Manager  │          │
├─────────────────────┴───────┴────────────┴───────────┴────────────────────────────┴─────────┴─────────────────┴─────────────────────┴────────────────────────┴────────────┴───────────┴──────────────────────────┴──────────┤
│ 5000 rows (20 shown)                                                                                                                                                                                             13 columns │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Connecting to the database

If you make an explicit connection to the database with the connect() method, the arguments determine whether the database resides completely in memory—and is destroyed when the Python process exits—or whether the database persists on disk. The following statements are equivalent and create a non-persistent database:

conn = duckdb.connect(database=":default:")
conn = duckdb.connect(database=":memory:")
conn = duckdb.connect(":memory:")
conn = duckdb.connect()

Once the connection is established, the same methods are used on the connection object as on the duckdb object:

conn = duckdb.connect()
conn.sql('select Oxygen, RunTime, RestPulse from "../datasets/fitness.csv" where RestPulse < 49;').df()

	Oxygen	RunTime	RestPulse
0	54.297	8.65	45
1	59.571	8.17	40
2	60.055	8.63	48
3	50.541	10.13	45
4	47.273	10.60	47
5	49.156	8.95	44
6	46.672	10.00	48
7	46.774	10.25	48
8	45.441	9.63	48
9	54.625	8.92	48
10	45.118	11.08	48
11	39.203	12.88	44

To create or access a persisted database, specify the name of the database using a .db, .ddb, or .duckdb extension. The following statements create or open the database test.db, read a CSV file into a Pandas DataFrame and create the table fitness from the DataFrame in the database.

conn = duckdb.connect(database="test.db")
df = conn.read_csv('../datasets/fitness.csv').df()
conn.sql('CREATE OR REPLACE TABLE fitness AS SELECT * FROM df')

You can see the tables in a database by querying the information schema:

conn.sql("select * from information_schema.tables;")

┌───────────────┬──────────────┬────────────┬────────────┬──────────────────────────────┬──────────────────────┬───────────────────────────┬──────────────────────────┬────────────────────────┬────────────────────┬──────────┬───────────────┬───────────────┐
│ table_catalog │ table_schema │ table_name │ table_type │ self_referencing_column_name │ reference_generation │ user_defined_type_catalog │ user_defined_type_schema │ user_defined_type_name │ is_insertable_into │ is_typed │ commit_action │ TABLE_COMMENT │
│    varchar    │   varchar    │  varchar   │  varchar   │           varchar            │       varchar        │          varchar          │         varchar          │        varchar         │      varchar       │ varchar  │    varchar    │    varchar    │
├───────────────┼──────────────┼────────────┼────────────┼──────────────────────────────┼──────────────────────┼───────────────────────────┼──────────────────────────┼────────────────────────┼────────────────────┼──────────┼───────────────┼───────────────┤
│ test          │ main         │ fitness    │ BASE TABLE │ NULL                         │ NULL                 │ NULL                      │ NULL                     │ NULL                   │ YES                │ NO       │ NULL          │ NULL          │
│ test          │ main         │ fitness2   │ BASE TABLE │ NULL                         │ NULL                 │ NULL                      │ NULL                     │ NULL                   │ YES                │ NO       │ NULL          │ NULL          │
└───────────────┴──────────────┴────────────┴────────────┴──────────────────────────────┴──────────────────────┴───────────────────────────┴──────────────────────────┴────────────────────────┴────────────────────┴──────────┴───────────────┴───────────────┘

or with

conn.sql("show tables;")

┌──────────┐
│   name   │
│ varchar  │
├──────────┤
│ fitness  │
│ fitness2 │
└──────────┘

And you can see the available databases with

conn.sql("show databases;")

┌───────────────┐
│ database_name │
│    varchar    │
├───────────────┤
│ test          │
└───────────────┘

You could have created the table in the database from the CSV file in a single step:

conn.sql('CREATE OR REPLACE TABLE fitness2 AS SELECT * FROM "../datasets/fitness.csv" ')

Notice the use of single and double quotes to distinguish the string for the file name inside the SQL statement.

To close the connection from the Python session to the database, use

conn.close()

Using `R`

To work with a DuckDB database in R, you make a connection to a database, which can be a file or an in-memory database, and issue primarily two commands on the connection handle:

dbGetQuery() is used for queries that return result sets, such as SELECT
dbExecute() is used for queries that do not return result sets, such as CREATE TABLE

The following code loads, the duckdb library, makes a connection to a database in a file, sends an SQL query, stores the result in a dataframe, and terminates the connection to the database.

library("duckdb")
con_r <- dbConnect(duckdb(),dbdir = "../ads.ddb",read_only=TRUE)
fit <- dbGetQuery(con_r, "SELECT * FROM fitness")
dbDisconnect(con_r)

14.3 Cloud Data Resources

Cloud resources are increasingly important for data storage and analytics. The growth trends in the cloud are staggering. Storing data in cloud file, block, or object storage is a $78.6 billion market with a compound annual growth rate (CAGR) of 18.5% (2022 numbers). Cloud databases in 2023 are a $21.3 billion market that grows at 22%.

You will encounter cloud-based data in your data science work, and you will take advantage of cloud resources, especially for handling and analyzing large data sets.

The most important cloud storage solutions for data are

Object storage
Cloud databases
Cloud data warehouses (Snowflake, BigQuery)
Cloud lakehouse (Databricks)

Object storage

Object, file, and block storage work differently. In object storage, data is managed as generic objects. These can be files or parts of files and combine the data to be stored with metadata about the object being stored. Metadata could be, for example, a unique identifier, size, creation date, time, and location for an image. Objects are stored in a flat structure; the object’s identifier is used to locate it in the storage system. Amazon Web Services (AWS) calls the container objects are stored in a bucket (rather than a directory).

File storage is what you are used to from a directory-based file system on your computer’s operating system. Data is stored in files, organized in hierarchical folders. You look up data by traversing the directory until you have located the file with the correct name.

Block storage breaks data into fixed-sized blocks and stores the blocks as units. The Hadoop Distributed File System (HDFS) is a block-oriented storage system, where data is split into 16 MB blocks and the blocks are replicated on multiple machines in a distributed system for durability and high availability.

Cloud object storage has nearly replaced other Big Data storage solutions such as Hadoop and HDFS. You can still find Hadoop environments in organizations—lots of money was invested in building those and organizations are slow to migrate off Hadoop. But the writing is on the wall: HDFS is replaced with cloud storage because it is cheaper and more durable, in particular cloud object storage. Since object storage supports storing files as objects, is the most economical storing alternative, and can scale out without limits, it is the primary storage form for fie-based data in the cloud. Even cloud data warehouses have been built with cloud object storage as the primary storage layer.

The object storage systems of the three major cloud providers are

Amazon S3 (Simple Storage Service) from AWS, Azure Blob Storage, and Google Cloud Storage. Common file formats for you find in these storage systems are CSV, JSON, and Parquet.

Managed cloud database

TODO

MotherDuck serverless cloud database

The convenience of a serverless database is not having to worry about the cloud instances the database is running on. With a local, on-premises database, you manage the storage and compute infrastructure—the machines—the database is running on. With a managed service (DBaaS), this aspect is taken care of by the managed service provider. However, you still must manage your database instances. For example, during periods of low usage, you might want to scale down the database or even shut it down to reduce cloud expenses.

Example: Please Turn Off Your Instances

A common mistake by cloud newbies is to assume that just because you are not actively using the cloud, you are not incurring expenses. When you start an instance in the cloud you need to stop it when it is not in use. Otherwise, it keeps running. Even if there is no workload on the instance, it will cost you money.

A class at a major university—which shall not be named—used cloud resources and the instructors forgot to tell students to shut down their cloud instances. Instead, the students started new instances every time they needed to do cloud computing. The academic unit conducting the class had allocated a $40,000 cloud computing budget over the next three years. The class spent $140,000 in one month.

A serverless system automatically scales to zero when not in use for a while. With serverless systems you tend to pay only for what you use, when you use it. An exception are charges for storing data, you cannot turn the storage off. The MotherDuck serverless database is a great complement to the DuckDB analytic database. In fact, MotherDuck is built on DuckDB and you can think of it as the cloud version of DuckDB. That makes MotherDuck a good choice for analytical workloads since DuckDB is a relational column-store, an analytic database.

Another nice feature of MotherDuck is the ability to connect to local databases and to cloud databases at the same time. This hybrid mode is useful if you have large data sets in cloud storage and small-to-large data sets stored locally. Since we already know how to work with DuckDB from Python, working with MotherDuck is a snap. The following statements create a connection to a MotherDuck serverless database:

import duckdb
con = duckdb.connect('md:')

Notice that we use the same duckdb library as previously. The special string ‘md:’ or ‘motherduck:’ inside the connect() function triggers a connection attempt to MotherDuck. Whether the connection attempt succeeds depends on whether you can authenticate to MotherDuck. By default, the connect() function will open a browser through which you can authenticate to the MotherDuck service using a previously established account.

Alternatively, you can click on Settings in the UI of the MotherDuck console at https://app.motherduck.com and download an authentication token. Store this token in the environment variable motherduck_token on your system and you can authenticate and connect directly to MotherDuck with the previous duckdb.connect() statement. It is also possible to list the authentication token directly in the duckdb.connect() function,

con = duckdb.connect('md:?motherduck_token=<token>')

but this is discouraged. Access credentials should not be shown in clear text in a source file that can easily be read and that might be shared with others. That is a security nightmare waiting to happen.

DuckDB and MotherDuck are relational database management systems, they manage more than one database. You can see the list of databases with the SHOW DATABASES command:

con.sql("SHOW DATABASES;")

By default, MotherDuck creates a database called my_db and a shared database with sample data. One of the cool features of MotherDuck is the ability to attach local databases to the instance. The following statement attaches a local version of the database for this course to MotherDuck.

con.sql("ATTACH 'ads.ddb';")
con.sql("SHOW DATABASES;")

Since MotherDuck manages multiple databases you have to tell it which database to work with. This is called the current database, by default this is my_db when you first connect. You can see which database is current with

con.sql("SELECT current_database();").show()

You can set any database as the current one with the USE statement:

con.sql("USE ads;")
con.sql("SHOW TABLES;")

You can access a table in a database that is not the current database by using a two-level name (database.table). This even works if one table is local and the other is in a cloud database. Suppose that ads is a local version of the database for the course and that ads_md is stored in the cloud.

con.sql("SELECT count(*) FROM ads.apples;")
con.sql("SELECT count(*) FROM ads_md.glaucoma;")

This comes in very handy if you want to process local and cloud data in the same query, for example when joining tables.

MotherDuck also supports shared read-only databases, to which you connect through a URL. The URL is created by an administrator of the database with the CREATE SHARE command. You can then attach the database by using the URL:

con.sql("ATTACH 'md:_share/ads_md/e00063c9-568c-45ec-a372-dca85f6915fd' as course_db")

The URL is a link to a point-in-time snapshot of the database when it was shared. The administrator can update the share of the database, the link will remain valid.

14.1 Accessing Local Files

CSV files

JSON files

Parquet files

ORC files

14.2 Working with a Database

SQLAlchemy

Ibis

DuckDB

Reading files

Connecting to the database

Using R

14.3 Cloud Data Resources

Object storage

Managed cloud database

MotherDuck serverless cloud database

Using `R`