11  Data Access

In this section we use Python libraries to access data in the major file formats presented in a previous chapter and convert the files into DataFrames: CSV, JSON, Parquet, and ORC files. The Pandas DataFrame format is common but there are others to consider:

Python has built-in functions for reading data formats, for example, csv.reader() for CSV files, we use the pandas and Pyarrow packages to read and write files in common formats. Pandas supports an impressive list of I/O formats and data sets likely end up as pandas DataFrames anyway, so we might as well convert from file format directly to a DataFrame. The Pyarrow library is becoming increasingly important in data engineering for data science; Pandas uses the Pyarrow engine for several of the important formats, e.g., parquet files. In Pandas 2.0 you can choose whether to use a NumPy-based engine or a Pyarrow-based engine for data manipulation and analysis.

Fortunately, the methods for reading data in Pandas and Polars are very similar, so it is relatively easy to switch from one DataFrame representation to another. For example, the following statements show how to read a CSV file into a Pandas and a Polars DataFrame.

import pandas as pd
import polars as pl

pandas_df = pd.read_csv("../datasets/iris.csv")
polars_df = pl.read_csv("../datasets/iris.csv")

Once you start working with DataFrames the syntax can be quite different, however.

11.1 Accessing Local Files

CSV files

To convert CSV files to a pandas DataFrame use the pandas read_csv() method. The default behavior is to look for a header line with column names, to interpret commas as field delimiters, to skip blank lines in the file, and to apply utf-8 encoding.

Consider the following file, fitness.csv:

data> head fitness.csv
Age,Weight,Oxygen,RunTime,RestPulse,RunPulse,MaxPulse
44,89.47,44.609,11.37,62,178,182
40,75.07,45.313,10.07,62,185,185
44,85.84,54.297, 8.65,45,156,168
42,68.15,59.571, 8.17,40,166,172
38,89.02,49.874, 9.22,55,178,180

The first line contains comma-separated names for the columns. The following lines contain the data for the attributes, one record per line. To convert the CSV file to a pandas DataFrame use the read_csv() method:

import numpy as np

df = pd.read_csv("../datasets/fitness.csv")
print("Default result from read_csv")
print(df.head())
df.describe(include=[np.float64])
Default result from read_csv
   Age  Weight  Oxygen  RunTime  RestPulse  RunPulse  MaxPulse
0   44   89.47  44.609    11.37         62       178       182
1   40   75.07  45.313    10.07         62       185       185
2   44   85.84  54.297     8.65         45       156       168
3   42   68.15  59.571     8.17         40       166       172
4   38   89.02  49.874     9.22         55       178       180
Weight Oxygen RunTime
count 31.000000 31.000000 31.000000
mean 77.444516 47.375806 10.586129
std 8.328568 5.327231 1.387414
min 59.080000 37.388000 8.170000
25% 73.200000 44.964500 9.780000
50% 77.450000 46.774000 10.470000
75% 82.325000 50.131000 11.270000
max 91.630000 60.055000 14.030000

You can assign your own variable names with the names= option:

df2 = pd.read_csv("../datasets/fitness.csv", 
                  header=0,
                  names=["Age", "Wgt","Oxy","RT","ReP","RuP","MP" ])
df2.head()
Age Wgt Oxy RT ReP RuP MP
0 44 89.47 44.609 11.37 62 178 182
1 40 75.07 45.313 10.07 62 185 185
2 44 85.84 54.297 8.65 45 156 168
3 42 68.15 59.571 8.17 40 166 172
4 38 89.02 49.874 9.22 55 178 180

String variables are indicated in the CSV file with quotes:

data>  head herding-cats.csv
"address_full","street","coat","sex","age","weight","fixed","wander_dist","roamer","cat_id"
"15 Fillmer Ave Los Gatos 95030","15 Fillmer Ave","brown","male",5.594,5.016,1,0.115,"yes",1
"244 Harding Ave Los Gatos 95030","244 Harding Ave","tabby","male",6.852,6.314,1,0.129,"yes",2
"16570 Marchmont Dr Los Gatos 95032","16570 Marchmont Dr","maltese","female",4.081, 2.652,0, 0.129,"yes",3
"100 Stonybrook Rd Los Gatos 95032","100 Stonybrook Rd","tabby","female",3.806,3.413,1, 0.107,"yes",4
cats = pd.read_csv("../datasets/herding-cats.csv",
                   index_col="cat_id")
print(cats.head())
                              address_full              street     coat  \
cat_id                                                                    
1           15 Fillmer Ave Los Gatos 95030      15 Fillmer Ave    brown   
2          244 Harding Ave Los Gatos 95030     244 Harding Ave    tabby   
3       16570 Marchmont Dr Los Gatos 95032  16570 Marchmont Dr  maltese   
4        100 Stonybrook Rd Los Gatos 95032   100 Stonybrook Rd    tabby   
5           266 Kennedy Rd Los Gatos 95032      266 Kennedy Rd   calico   

           sex    age  weight  fixed  wander_dist roamer  
cat_id                                                    
1         male  5.594   5.016      1        0.115    yes  
2         male  6.852   6.314      1        0.129    yes  
3       female  4.081   2.652      0        0.129    yes  
4       female  3.806   3.413      1        0.107    yes  
5         male  5.164   6.820      1        0.039     no  

Pandas is smart about assigning data types to columns, note that string columns are stored as data type object.

cats.info()
<class 'pandas.core.frame.DataFrame'>
Index: 400 entries, 1 to 400
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   address_full  400 non-null    object 
 1   street        394 non-null    object 
 2   coat          400 non-null    object 
 3   sex           400 non-null    object 
 4   age           400 non-null    float64
 5   weight        400 non-null    float64
 6   fixed         400 non-null    int64  
 7   wander_dist   400 non-null    float64
 8   roamer        400 non-null    object 
dtypes: float64(3), int64(1), object(5)
memory usage: 31.2+ KB

You can overwrite the data types by supplying either a default type or a dictionary of types. To change the data type for the fixed column, for example:

cats = pd.read_csv("../datasets/herding-cats.csv",
                   dtype={'fixed':np.float16})

Missing values (see below) can be indicated in CSV files in several ways. The common technique is to leave the entry for the unobserved value empty (no spaces). In the following file the value for appraisal is missing in the third row. The fifth row contains values for land and appraisal only.

data> head landsales2.csv
land, improve, total, sale, appraisal
30000,64831,94831,118500,1.25
30000,50765,80765,93900,1.16
46651,18573,65224,,
45990,91402,137392,184000,1.34
42394,,,168000,
,133351,,169000,
63596,2182,65778,,
56658,153806,210464,255000,1.21

read_csv() converts this file correctly to missing values using the NaN (not-a-number) sentinel for the unobserved data:

land = pd.read_csv("../datasets/landsales2.csv")
land
land improve total sale appraisal
0 30000.0 64831.0 94831.0 118500.0 1.25
1 30000.0 50765.0 80765.0 93900.0 1.16
2 46651.0 18573.0 65224.0 NaN NaN
3 45990.0 91402.0 137392.0 184000.0 1.34
4 42394.0 NaN NaN 168000.0 NaN
5 NaN 133351.0 NaN 169000.0 NaN
6 63596.0 2182.0 65778.0 NaN NaN
7 56658.0 153806.0 210464.0 255000.0 1.21
8 51428.0 72451.0 123879.0 NaN NaN
9 93200.0 NaN NaN 422000.0 NaN
10 76125.0 78172.0 275297.0 290000.0 1.14
11 154360.0 61934.0 216294.0 237000.0 1.10
12 65376.0 NaN NaN 286500.0 NaN
13 42400.0 NaN NaN NaN NaN
14 40800.0 92606.0 133406.0 168000.0 1.26

In addition to empty entries, read.csv() interprets a number of entries as missing (unobserved) values: “#N/A”, “#N/A N/A”, “#NA”, “-1.#IND”, “-1.#QNAN”, “-NaN”, “-nan”, “1.#IND”, “1.#QNAN”, “<NA>”, “N/A”, “NA”, “NULL”, “NaN”, “None”, “n/a”, “nan”, “null “. Despite those, CSV files might have special values to indicate missing values, for example, when they are exported from other software packages. Suppose that the missing values are indicated by a dot (“.”). Also, the columns in the CSV file are formatted with white space:

data> head landsales.csv
land, improve, total, sale, appraisal
30000,     64831,     94831,    118500,   1.25
30000,     50765,     80765,     93900,   1.16
46651,     18573,     65224,         .,    .
45990,     91402,    137392,    184000,   1.34
42394,         .,         .,    168000,    .
    .,    133351,         .,    169000,    .
63596,      2182,     65778,         .,    .
56658,    153806,    210464,    255000,   1.21
51428,     72451,    123879,         .,    .

The file is much more human readable, but requires additional options to convert to the intended DataFrame:

land2 = pd.read_csv("../datasets/landsales.csv", 
                   na_values=".",
                   skipinitialspace=True)
land2.head()
land improve total sale appraisal
0 30000.0 64831.0 94831.0 118500.0 1.25
1 30000.0 50765.0 80765.0 93900.0 1.16
2 46651.0 18573.0 65224.0 NaN NaN
3 45990.0 91402.0 137392.0 184000.0 1.34
4 42394.0 NaN NaN 168000.0 NaN

The na.values= option specifies additional values that should be recognized as missing values. The skipinitalspace= option tellsread_csv() to ignore spaces following the delimiters. The result is as intended.

JSON files

Pandas has great support for reading and writing JSON files. A general issue is that JSON is a hierarchical data format that allows nested data structures whereas pandas DataFrames have a row—column layout. Fortunately, the pandasread_json() method has many options to shape nested JSON structures into DataFrames.

You can convert any JSON document into a Python dictionary with the load() function in the json library. Suppose we want to read this document, stored in ../datasets/JSON/simple1.json into Python:

> cat ../datasets/JSON/simple1.json
{
    "firstName": "Jane",
    "lastName": "Doe",
    "hobbies": ["running", "sky diving", "singing"],
    "age": 35,
    "children": [
        {
            "firstName": "Alice",
            "age": 6
        },
        {
            "firstName": "Bob",
            "age": 8
        }
    ]
}
import numpy as np
import pandas as pd
import json
with open("../datasets/JSON/simple1.json","r") as f:
    data = json.load(f)
data
{'firstName': 'Jane',
 'lastName': 'Doe',
 'hobbies': ['running', 'sky diving', 'singing'],
 'age': 35,
 'children': [{'firstName': 'Alice', 'age': 6},
  {'firstName': 'Bob', 'age': 8}]}

The next example reads a record-oriented JSON file (../datasets/JSON/simple.json) directly into a DataFrame. Each record consists of fields id, name, math, statistics, and weight.

cat simple.json 
[
  {
    "id": "A00123",
    "name": "David",
    "math": 70,
    "statistics": 86,
    "weight": 156.3
  },
  {
    "id": "B00422",
    "name": "Andrew",
    "math": 89,
    "statistics": 80,
    "weight": 210.6
  },
  {
    "id": "C004543",
    "name": "Tobias",
    "math": 79,
    "statistics": 90,
    "weight": 167.0
  }
]
df = pd.read_json("../datasets/JSON/simple.json")
df.info()
df
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   id          3 non-null      object 
 1   name        3 non-null      object 
 2   math        3 non-null      int64  
 3   statistics  3 non-null      int64  
 4   weight      3 non-null      float64
dtypes: float64(1), int64(2), object(2)
memory usage: 252.0+ bytes
id name math statistics weight
0 A00123 David 70 86 156.3
1 B00422 Andrew 89 80 210.6
2 C004543 Tobias 79 90 167.0

If you’d rather assign data type string to the id and name fields than data type object, you can set the data type with

df2 = pd.read_json("../datasets/JSON/simple.json", 
                   dtype= {"id": "string", "name" : "string"})
df2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   id          3 non-null      string 
 1   name        3 non-null      string 
 2   math        3 non-null      int64  
 3   statistics  3 non-null      int64  
 4   weight      3 non-null      float64
dtypes: float64(1), int64(2), string(2)
memory usage: 252.0 bytes

How can we handle a nested JSON structure? The data in ../datasets/JSON/nested_array.json is

data> cat nested_array.json
{
    "school_name": "Virginia Tech",
    "class": "Class of 2026",
    "students": [
      {
        "id": "A00123",
        "name": "David",
        "math": 70,
        "statistics": 86,
        "weight": 156.3
      },
      {
        "id": "B00422",
        "name": "Andrew",
        "math": 89,
        "statistics": 80,
        "weight": 210.6
      },
      {
        "id": "C004543",
        "name": "Tobias",
        "math": 79,
        "statistics": 90,
        "weight": 167.0
      }]
}

The students fields have a record structure but there are additional fields school_name and class that do not fit that structure. Converting the JSON file directly into a DataFrame does not produce the desired result:

df3 = pd.read_json('../datasets/JSON/nested_array.json')
df3
school_name class students
0 Virginia Tech Class of 2026 {'id': 'A00123', 'name': 'David', 'math': 70, ...
1 Virginia Tech Class of 2026 {'id': 'B00422', 'name': 'Andrew', 'math': 89,...
2 Virginia Tech Class of 2026 {'id': 'C004543', 'name': 'Tobias', 'math': 79...

To create the appropriate DataFrame we take two steps: convert the JSON file into a dictionary, then shape the dictionary into a DataFrame with json_normalize():

with open('../datasets/JSON/nested_array.json','r') as f:
    data = json.load(f)

df4 = pd.json_normalize(data, 
                        record_path =["students"],
                        meta=["school_name", "class"])
df4
id name math statistics weight school_name class
0 A00123 David 70 86 156.3 Virginia Tech Class of 2026
1 B00422 Andrew 89 80 210.6 Virginia Tech Class of 2026
2 C004543 Tobias 79 90 167.0 Virginia Tech Class of 2026

The meta= option lists the JSON fields that are used as data that applies to each record in the result table. The record_path= option points at the field that contains the list of records.

Now suppose that in addition to the list of records the JSON metadata contains a more complex structure:

cat nested_obj_array.json
{
   "school_name": "Virginia Tech",
   "class": "Class of 2026",
    "info": {
      "president": "Timothy D. Sands",
      "address": "Office of the President",
      "social": {
        "LinkedIn": "tim-sands-49b8b95",
        "Twitter": "@VTSandsman"
      }
    },
    "students": [
      {
        "id": "A00123",
        "name": "David",
        "math": 70,
        "statistics": 86,
        "weight": 156.3
      },
. . .
      }]
}

You can specify which fields to extract from the metadata in the meta= option of json_normalize():

with open('../datasets/JSON/nested_obj_array.json','r') as f:
    data = json.load(f)

df5 = pd.json_normalize(data,
                       record_path =["students"], 
                       meta=["school_name","class",
                             ["info", "president"], 
                             ["info", "social", "Twitter"]])
df5
id name math statistics weight school_name class info.president info.social.Twitter
0 A00123 David 70 86 156.3 Virginia Tech Class of 2026 Timothy D. Sands @VTSandsman
1 B00422 Andrew 89 80 210.6 Virginia Tech Class of 2026 Timothy D. Sands @VTSandsman
2 C004543 Tobias 79 90 167.0 Virginia Tech Class of 2026 Timothy D. Sands @VTSandsman

How are missing values handled? In the next example, the first has a complete record. The second student misses a statistics score, the third student has no weight data. pd.json_normalize() fills in NaN for the unobserved data.

data = [
    {
        "students": [
        {
            "id": "A00123",
            "name": "David",
            "math": 70,
            "statistics": 86,
            "weight": 156.3
        },
        {
            "id": "B00422",
            "name": "Andrew",
            "math": 89,
            "weight": 210.6
          },
        {
            "id": "C004543",
            "name": "Tobias",
            "math": 79,
            "statistics": 90,
        }
        ]
}
]

df6 = pd.json_normalize(data, record_path =["students"])
df6
id name math statistics weight
0 A00123 David 70 86.0 156.3
1 B00422 Andrew 89 NaN 210.6
2 C004543 Tobias 79 90.0 NaN

json_normalize() is less forgiving if fields in the metadata are unobserved; it will throw an error. To prevent the error and assign NaN values, use the errors=’ignore’ option. In the following example the teachers meta data field does not always have a l_name entry.

data = [
    { 
        'class': 'STAT 5525', 
        'student count': 60, 
        'room': 'Yellow',
        'teachers': { 
                'f_name': 'Elon', 
                'l_name': 'Musk',
            },
        'students': [
            { 'name': 'Tom', 'sex': 'M' },
            { 'name': 'James', 'sex': 'M' },
        ]
    },
    { 
        'class': 'STAT 5526', 
        'student count': 45, 
        'room': 'Blue',
         'teachers': { 
                'f_name': 'Albert'
            },
        'students': [
            { 'name': 'Tony', 'sex': 'M' },
            { 'name': 'Jacqueline', 'sex': 'F' },
        ]
    },
]

df7 = pd.json_normalize(data, 
                        record_path =["students"], 
                        meta=["class", "room", ["teachers", "l_name"]],
                        errors='ignore')

df7
name sex class room teachers.l_name
0 Tom M STAT 5525 Yellow Musk
1 James M STAT 5525 Yellow Musk
2 Tony M STAT 5526 Blue NaN
3 Jacqueline F STAT 5526 Blue NaN

Parquet files

To read parquet files into pandas DataFrames we have two choices: the read_parquet() pandas method or the read_table() method from pyarrow.parquet.

The data we work with in this section is stored in a parquet multi-file structure; that means the data are split into multiple files stored in a local directory:

Parquet ll userdata
total 1120
-rw-r--r--@ 1 olivers  staff   111K Aug 31 14:36 userdata1.parquet
-rw-r--r--@ 1 olivers  staff   110K Aug 31 14:36 userdata2.parquet
-rw-r--r--@ 1 olivers  staff   111K Aug 31 14:37 userdata3.parquet
-rw-r--r--@ 1 olivers  staff   110K Aug 31 14:37 userdata4.parquet
-rw-r--r--@ 1 olivers  staff   111K Aug 31 14:37 userdata5.parquet

Each of the *.parquet files in the directory contains 1,000 observations.

Pandas’ read_parquet() reads the entire directory and combines the records from the multiple files into a single DataFrame.

df = pd.read_parquet("../datasets/Parquet/userdata", engine="pyarrow")
df.info()
df.describe()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   registration_dttm  5000 non-null   datetime64[ns]
 1   id                 4999 non-null   float64       
 2   first_name         5000 non-null   object        
 3   last_name          5000 non-null   object        
 4   email              5000 non-null   object        
 5   gender             5000 non-null   object        
 6   ip_address         5000 non-null   object        
 7   cc                 5000 non-null   object        
 8   country            5000 non-null   object        
 9   birthdate          5000 non-null   object        
 10  salary             4689 non-null   float64       
 11  title              5000 non-null   object        
 12  comments           4966 non-null   object        
dtypes: datetime64[ns](1), float64(2), object(10)
memory usage: 507.9+ KB
registration_dttm id salary
count 5000 4999.000000 4689.000000
mean 2016-02-03 21:04:13.699399936 500.598720 150772.222890
min 2016-02-03 00:00:07 1.000000 12068.960000
25% 2016-02-03 08:51:55.750000128 251.000000 83480.380000
50% 2016-02-03 20:01:17 501.000000 152877.190000
75% 2016-02-04 07:42:54.500000 750.500000 215405.220000
max 2016-02-04 23:59:55 1000.000000 286735.820000
std NaN 288.648331 78171.513062

The size of df confirms that all 5,000 observations were retrieved from the five component files.

With pyarrow we have more control over the files, for example, we can inquire about the schema and metadata of the component files.

import pyarrow.parquet as pq

pq_file = pq.ParquetFile("../datasets/Parquet/userdata/userdata1.parquet")
pq_file.metadata
pq_file.schema
<pyarrow._parquet.ParquetSchema object at 0x147814b00>
required group field_id=-1 hive_schema {
  optional int96 field_id=-1 registration_dttm;
  optional int32 field_id=-1 id;
  optional binary field_id=-1 first_name (String);
  optional binary field_id=-1 last_name (String);
  optional binary field_id=-1 email (String);
  optional binary field_id=-1 gender (String);
  optional binary field_id=-1 ip_address (String);
  optional binary field_id=-1 cc (String);
  optional binary field_id=-1 country (String);
  optional binary field_id=-1 birthdate (String);
  optional double field_id=-1 salary;
  optional binary field_id=-1 title (String);
  optional binary field_id=-1 comments (String);
}

To convert all component files into a pandas DataFrame with pyarrow, use the read_table() method and send the result to_pandas():

df2 = pq.read_table(source="../datasets/Parquet/userdata/").to_pandas()
df2.describe()
registration_dttm id salary
count 5000 4999.000000 4689.000000
mean 2016-02-03 21:04:13.699399936 500.598720 150772.222890
min 2016-02-03 00:00:07 1.000000 12068.960000
25% 2016-02-03 08:51:55.750000128 251.000000 83480.380000
50% 2016-02-03 20:01:17 501.000000 152877.190000
75% 2016-02-04 07:42:54.500000 750.500000 215405.220000
max 2016-02-04 23:59:55 1000.000000 286735.820000
std NaN 288.648331 78171.513062

ORC files

Pandas (and pyarrow) supports ORC files but working with ORC files is not quite as convenient as with parquet files. The read_orc() method reads one ORC file rather than an entire directory and it is not supported on Windows.

df = pd.read_orc("../datasets/ORC/userdata/userdata1.orc")

When data is stored in multi-file format, as in the parquet example, you need to concatenate the contents into a pandas DataFrame. The glob module is helpful to retrieve the list of file names that match a pattern.

import glob
folder_path = "../datasets/ORC/userdata"
file_list = glob.glob(folder_path + "/*.orc")
df_from_orc = pd.DataFrame(pd.read_orc(file_list[0]))

for i in range(1,len(file_list)):
    data = pd.read_orc(file_list[i])
    df = pd.DataFrame(data)
    df_from_orc = pd.concat([df_from_orc,df],axis=0)

You can also use the read_table() method for ORC files in pyarrow to read ORC an file into a pyarrow table and to_pandas() to convert the table to a pandas DataFrame:

from pyarrow import orc

df = orc.read_table(source="../datasets/ORC/userdata/userdata1.orc").to_pandas()

11.2 Working with a Database

As a data scientist you need to be comfortable working with databases. In analytical work, you will encounter schema-centric relational OLAP databases that use SQL (structured query language) to interact with the database and the data. SQL interfaces are standard for data warehouses and common for many data platforms and tools.

Data scientists generally prefer writing Python or R code over SQL. SQL is extremely powerful to express relational concepts and can make some data processing tasks much easier than rolling your own code. Also, databases are optimized for joins, merges, deep queries, aggregations, indexes, handling JSON, etc. It would be a shame not to take advantage of that power or to lose it behind the syntactic sugar of a tool’s API. Many employers have voiced concerns that SQL ranks high among the technical skills data scientists often lack. As the founder and chairman of a data science research and consultancy company put it:

There are a lot of things folks don’t agree on. But everyone agrees that as a data scientist you will need to retrieve data from a database, and you will need version control. Learn SQL and learn git.

We are covering SQL basics in Chapter 15 after sprinkling some SQL statements along the way.

If you want to work with SQL directly from Python, there are several alternatives.

SQLAlchemy

SQLAlchemy is a SQL toolkit for Python that gives Python developers the power of SQL. While some tools try to hide SQL and relational details behind syntactic sugar, SQLAlchemy’s philosophy is to fully expose the relational details of SQL within transparent tools. Specific databases, beyond those for which support is built into SQLAlchemy, are supported as dialects.

Ibis

Ibis is a portable Python dataframe library that supports multiple database backends. After making a connection to a backend database, you interact with the data using Python syntax rather than SQL statements. The Ibis function calls are translated into SQL for the specific backend under the covers. For example, the first statement in the next code block creates a Python object named penguins through which you can interact with the database table of the same name. The second statement filters the data for a specific species.

penguins = con.table("penguins")
penguins.filter(penguins.species == "Adelie")

The SQL equivalent would be a SELECT statement with a WHERE clause:

SELECT * FROM penguins WHERE species = “Adelie”

Having stable Python code for different database backends makes it easy to move from one database to another. During development of analytic code you might be working with SQLite and the production database is Oracle. The same Python code will work in both environments.

The default backend of Ibis is the embedded database DuckDB. If you are working with DuckDB you can issue SQL statements directly from Python or you can rely on its dedicated Python API.

DuckDB

DuckDB has all the ingredients we are looking for in a database for data science work:

  • Designed to support analytical queries (OLAP workloads) characterized by complex queries that process large amounts of data.

  • Columnar, vectorized storage engine. Columnar storage is more efficient for analytical queries, vectorized queries execute much faster than systems that process rows sequentially.

  • Provides transactional guarantees (ACID properties)

  • Deeply integrated with Python or R for efficient data analysis.

  • Easy to install and operate, without any external dependencies.

  • Open-source database that is free to use.

  • Can query CSV, JSON, Parquet files and DataFrames directly.

  • Supports a streaming model for analyzing large data sets that exceed memory capacity, e.g., streaming from Parquet files.

  • In-memory database with graceful on-disk operation.

  • Easy transition to the cloud through partnership with MotherDuck.

  • SQL code can be submitted from the CLI or through the Python API, so you can take advantage of the power of SQL.

  • Runs as an embedded database within a host process.

Because DuckDB is an embedded database, there is no database server to install and to maintain. Because the database runs within the host process, data can be transferred very quickly. And if you want to move the data to the cloud, you can do that easily with the MotherDuck serverless platform.

The Python package for DuckDB can run queries directly from Pandas DataFrame without copying any data.

Why is it called DuckDB? We let the folks who created it explain it:

Ducks are amazing animals. They can fly, walk and swim. They can also live off pretty much everything. They are quite resilient to environmental challenges. A duck’s song will bring people back from the dead and inspires database research. They are thus the perfect mascot for a versatile and resilient data management system. Also the logo designs itself.

For the Python user, installation is as simple as with any other Python package:

pip install duckdb

If you install with conda, the command is

conda install python-duckdb -c conda-forge

DuckDB is an in-memory database. If you want to persist the database across Python sessions, then you need to create a connection that stores the data in a single file on disk. Otherwise, the database will go away when the Python session ends. You can work without a connection to a database by using methods on the duckdb module. For example, the following statements directly query a CSV file using DuckDB:

import duckdb
duckdb.sql('select * from "../datasets/fitness.csv" where Age > 50')
┌───────┬────────┬────────┬─────────┬───────────┬──────────┬──────────┐
│  Age  │ Weight │ Oxygen │ RunTime │ RestPulse │ RunPulse │ MaxPulse │
│ int64 │ double │ double │ double  │   int64   │  int64   │  int64   │
├───────┼────────┼────────┼─────────┼───────────┼──────────┼──────────┤
│    54 │  83.12 │ 51.855 │   10.33 │        50 │      166 │      170 │
│    51 │  69.63 │ 40.836 │   10.95 │        57 │      168 │      172 │
│    51 │  77.91 │ 46.672 │    10.0 │        48 │      162 │      168 │
│    57 │  73.37 │ 39.407 │   12.63 │        58 │      174 │      176 │
│    54 │  79.38 │  46.08 │   11.17 │        62 │      156 │      165 │
│    52 │  76.32 │ 45.441 │    9.63 │        48 │      164 │      166 │
│    51 │  67.25 │ 45.118 │   11.08 │        48 │      172 │      172 │
│    54 │  91.63 │ 39.203 │   12.88 │        44 │      168 │      172 │
│    51 │  73.71 │  45.79 │   10.47 │        59 │      186 │      188 │
│    57 │  59.08 │ 50.545 │    9.93 │        49 │      148 │      155 │
│    52 │  82.78 │ 47.467 │    10.5 │        53 │      170 │      172 │
├───────┴────────┴────────┴─────────┴───────────┴──────────┴──────────┤
│ 11 rows                                                   7 columns │
└─────────────────────────────────────────────────────────────────────┘

The object returned by duckdb is a relation, a symbolic representation of the query that can be referenced in other queries. The next statements save the result of the previous query and reference the relation in the second query:

rel = duckdb.sql('select * from "../datasets/fitness.csv" where Age > 50')
duckdb.sql("select Oxygen, RunTime, RestPulse from rel where RestPulse < 49")
┌────────┬─────────┬───────────┐
│ Oxygen │ RunTime │ RestPulse │
│ double │ double  │   int64   │
├────────┼─────────┼───────────┤
│ 46.672 │    10.0 │        48 │
│ 45.441 │    9.63 │        48 │
│ 45.118 │   11.08 │        48 │
│ 39.203 │   12.88 │        44 │
└────────┴─────────┴───────────┘

The results of a DuckDB query can be easily converted into other formats. fetchnumpy() fetches data as a dictionary of NumPy arrays, df() fetches a Pandas DataFrame, pl() fetches a Polars DataFrame, and arrow() fetches an Arrow table.

duckdb.sql("select Oxygen, RunTime, RestPulse from rel where RestPulse < 49").df()
Oxygen RunTime RestPulse
0 46.672 10.00 48
1 45.441 9.63 48
2 45.118 11.08 48
3 39.203 12.88 44

Reading files

To read the major data science file formats into DuckDB, use the read_csv(), read_parquet(), read_json(), and read_arrow() methods. For example, to read the simple JSON file from the previous section:

duckdb.read_json('../datasets/JSON/simple.json')
┌─────────┬─────────┬───────┬────────────┬────────┐
│   id    │  name   │ math  │ statistics │ weight │
│ varchar │ varchar │ int64 │   int64    │ double │
├─────────┼─────────┼───────┼────────────┼────────┤
│ A00123  │ David   │    70 │         86 │  156.3 │
│ B00422  │ Andrew  │    89 │         80 │  210.6 │
│ C004543 │ Tobias  │    79 │         90 │  167.0 │
└─────────┴─────────┴───────┴────────────┴────────┘

To read a directory of Parquet files, you can use wildcard syntax in the path of the read_parquet() function:

duckdb.read_parquet('../datasets/Parquet/userdata/user*.parquet')
┌─────────────────────┬───────┬────────────┬───┬────────────┬───────────┬──────────────────────┬──────────────────────┐
│  registration_dttm  │  id   │ first_name │ … │ birthdate  │  salary   │        title         │       comments       │
│      timestamp      │ int32 │  varchar   │   │  varchar   │  double   │       varchar        │       varchar        │
├─────────────────────┼───────┼────────────┼───┼────────────┼───────────┼──────────────────────┼──────────────────────┤
│ 2016-02-03 16:07:46 │     1 │ Ernest     │ … │            │ 140639.36 │                      │                      │
│ 2016-02-03 21:52:07 │     2 │ Anthony    │ … │ 1/16/1998  │ 172843.61 │ Developer II         │ 👾 🙇 💁 🙅 🙆 🙋 …  │
│ 2016-02-03 02:22:19 │     3 │ Ryan       │ … │ 11/21/1978 │ 204620.66 │ Developer I          │ ␢                    │
│ 2016-02-03 04:20:04 │     4 │ Brenda     │ … │ 10/29/1998 │ 260474.12 │ GIS Technical Arch…  │                      │
│ 2016-02-03 00:15:16 │     5 │ Jacqueline │ … │ 7/12/1959  │ 286038.78 │ Marketing Assistant  │                      │
│ 2016-02-03 19:48:14 │     6 │ Paul       │ … │            │ 241518.24 │                      │                      │
│ 2016-02-03 08:59:05 │     7 │ Linda      │ … │ 3/30/1988  │ 192756.38 │ Professor            │                      │
│ 2016-02-03 08:04:51 │     8 │ Frances    │ … │            │ 188511.28 │                      │ <svg><script>0<1>a…  │
│ 2016-02-03 08:12:33 │     9 │ Jason      │ … │ 7/29/1982  │ 238068.56 │ Web Designer III     │                      │
│ 2016-02-03 17:08:02 │    10 │ Carolyn    │ … │ 4/28/1977  │ 132718.26 │ Research Nurse       │                      │
│          ·          │     · │  ·         │ · │     ·      │     ·     │       ·              │          ·           │
│          ·          │     · │  ·         │ · │     ·      │     ·     │       ·              │          ·           │
│          ·          │     · │  ·         │ · │     ·      │     ·     │       ·              │          ·           │
│ 2016-02-04 04:48:13 │   991 │ Fred       │ … │ 1/25/1975  │ 280835.07 │ Cost Accountant      │                      │
│ 2016-02-04 00:03:33 │   992 │ Anna       │ … │ 5/29/1962  │ 286181.88 │ Automation Special…  │                      │
│ 2016-02-04 07:05:48 │   993 │ Donna      │ … │ 6/3/1962   │  245704.0 │ Senior Financial A…  │                      │
│ 2016-02-04 21:15:41 │   994 │ Annie      │ … │            │ 194931.47 │                      │                      │
│ 2016-02-04 05:05:12 │   995 │ Carlos     │ … │ 8/11/1986  │  39424.88 │ Teacher              │                      │
│ 2016-02-04 21:45:05 │   996 │ Stephanie  │ … │ 5/9/1962   │  273504.1 │ Chief Design Engin…  │                      │
│ 2016-02-04 09:20:21 │   997 │ Kathy      │ … │ 12/4/1986  │ 109525.48 │ Director of Sales    │                      │
│ 2016-02-04 00:26:05 │   998 │ Louis      │ … │ 11/20/1982 │  13134.47 │ Office Assistant IV  │                      │
│ 2016-02-04 04:50:33 │   999 │ Elizabeth  │ … │ 3/14/1976  │ 207194.65 │ Research Assistant I │                      │
│ 2016-02-04 16:14:42 │  1000 │ Susan      │ … │ 12/19/1999 │ 229961.89 │ Human Resources Ma…  │                      │
├─────────────────────┴───────┴────────────┴───┴────────────┴───────────┴──────────────────────┴──────────────────────┤
│ 5000 rows (20 shown)                                                                           13 columns (7 shown) │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Connecting to the database

If you make an explicit connection to the database with the connect() method, the arguments determine whether the database resides completely in memory—and is destroyed when the Python process exits—or whether the database persists on disk. The following statements are equivalent and create a non-persistent database:

conn = duckdb.connect(database=":default:")
conn = duckdb.connect(database=":memory:")
conn = duckdb.connect(":memory:")
conn = duckdb.connect()

Once the connection is established you use the same methods on the connection object as on the duckdb object:

conn = duckdb.connect()
conn.sql("select Oxygen, RunTime, RestPulse from rel where RestPulse < 49").df()
Oxygen RunTime RestPulse
0 46.672 10.00 48
1 45.441 9.63 48
2 45.118 11.08 48
3 39.203 12.88 44

To create or access a persisted database, specify the name of the database using a .db, .ddb, or .duckdb extension. The following statements create or open the database test.db, read a CSV file into a Pandas DataFrame and create the table fitness from the DataFrame in the database.

conn = duckdb.connect(database="test.db")
df = conn.read_csv('../datasets/fitness.csv').df()
conn.execute('CREATE OR REPLACE TABLE fitness AS SELECT * FROM df')
<duckdb.duckdb.DuckDBPyConnection at 0x168f82270>

You can see the tables in a database by querying the information schema:

conn.sql('select * from information_schema.tables')
┌───────────────┬──────────────┬────────────┬────────────┬───┬────────────────────┬──────────┬───────────────┐
│ table_catalog │ table_schema │ table_name │ table_type │ … │ is_insertable_into │ is_typed │ commit_action │
│    varchar    │   varchar    │  varchar   │  varchar   │   │      varchar       │ varchar  │    varchar    │
├───────────────┼──────────────┼────────────┼────────────┼───┼────────────────────┼──────────┼───────────────┤
│ test          │ main         │ fitness    │ BASE TABLE │ … │ YES                │ NO       │ NULL          │
├───────────────┴──────────────┴────────────┴────────────┴───┴────────────────────┴──────────┴───────────────┤
│ 1 rows                                                                                12 columns (7 shown) │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

or with

conn.sql("show tables")
┌─────────┐
│  name   │
│ varchar │
├─────────┤
│ fitness │
└─────────┘

And you can see the available databases with

conn.sql("show databases")
┌───────────────┐
│ database_name │
│    varchar    │
├───────────────┤
│ test          │
└───────────────┘

You could have created the table in the database from the CSV file in a single step:

conn = duckdb.connect(database="ads5064.db")
conn.execute('CREATE TABLE fitness AS SELECT * FROM “../data/fitness.csv” ')

Notice the use of single and double quotes to distinguish the string for the file name inside the SQL statement.

To close the connection from the Python session to the database, use

conn.close()

11.3 Cloud Data Resources

Cloud resources are increasingly important for data storage and analytics. The growth trends in the cloud are staggering. Storing data in cloud file, block, or object storage is a $78.6 billion market with a compound annual growth rate (CAGR) of 18.5% (2022 numbers). Cloud databases in 2023 are a $21.3 billion market that grows at 22%.

You will encounter cloud-based data in your data science work, and you will take advantage of cloud resources, especially for handling and analyzing large data sets.

The most important cloud storage solutions for data are

  • Object storage
  • Cloud databases
  • Cloud data warehouses (Snowflake, BigQuery)
  • Cloud lakehouse (Databricks)

Object storage

Object, file, and block storage work differently. In object storage, data is managed as generic objects. These can be files or parts of files and combine the data to be stored with metadata about the object being stored. Metadata could be, for example, a unique identifier, size, creation date, time, and location for an image. Objects are stored in a flat structure; the object’s identifier is used to locate it in the storage system. Amazon Web Services (AWS) calls the container objects are stored in a bucket (rather than a directory).

File storage is what you are used to from a directory-based file system on your computer’s operating system. Data is stored in files, organized in hierarchical folders. You look up data by traversing the directory until you have located the file with the correct name.

Block storage breaks data into fixed-sized blocks and stores the blocks as units. The Hadoop Distributed File System (HDFS) is a block-oriented storage system, where data is split into 16 MB blocks and the blocks are replicated on multiple machines in a distributed system for durability and high availability.

Cloud object storage has nearly replaced other Big Data storage solutions such as Hadoop and HDFS. You can still find Hadoop environments in organizations—lots of money was invested in building those and organizations are slow to migrate off Hadoop. But the writing is on the wall: HDFS is replaced with cloud storage because it is cheaper and more durable, in particular cloud object storage. Since object storage supports storing files as objects, is the most economical storing alternative, and can scale out without limits, it is the primary storage form for fie-based data in the cloud. Even cloud data warehouses have been built with cloud object storage as the primary storage layer.

The object storage systems of the three major cloud providers are

Amazon S3 (Simple Storage Service) from AWS, Azure Blob Storage, and Google Cloud Storage. Common file formats for you find in these storage systems are CSV, JSON, and Parquet.

Managed cloud database

TODO

MotherDuck serverless cloud database

The convenience of a serverless database is not having to worry about the cloud instances the database is running on. With a local, on-premises database, you manage the storage and compute infrastructure—the machines—the database is running on. With a managed service (DBaaS), this aspect is taken care of by the managed service provider. However, you still must manage your database instances. For example, during periods of low usage, you might want to scale down the database or even shut it down to reduce cloud expenses.

Example: Please Turn Off Your Instances

A common mistake by cloud newbies is to assume that just because you are not actively using the cloud, you are not incurring expenses. When you start an instance in the cloud you need to stop it when it is not in use. Otherwise, it keeps running. Even if there is no workload on the instance, it will cost you money.

A class at a major university—which shall not be named—used cloud resources and the instructors forgot to tell students to shut down their cloud instances. Instead, the students started new instances every time they needed to do cloud computing. The academic unit conducting the class had allocated a $40,000 cloud computing budget over the next three years. The class spent $140,000 in one month.

A serverless system automatically scales to zero when not in use for a while. With serverless systems you tend to pay only for what you use, when you use it. An exception are charges for storing data, you cannot turn the storage off. The MotherDuck serverless database is a great complement to the DuckDB analytic database. In fact, MotherDuck is built on DuckDB and you can think of it as the cloud version of DuckDB. That makes MotherDuck a good choice for analytical workloads since DuckDB is a relational column-store, an analytic database.

Another nice feature of MotherDuck is the ability to connect to local databases and to cloud databases at the same time. This hybrid mode is useful if you have large data sets in cloud storage and small-to-large data sets stored locally. Since we already know how to work with DuckDB from Python, working with MotherDuck is a snap. The following statements create a connection to a MotherDuck serverless database:

import duckdb
con = duckdb.connect('md:')

Notice that we use the same duckdb library as previously. The special string ‘md:’ or ‘motherduck:’ inside the connect() function triggers a connection attempt to MotherDuck. Whether the connection attempt succeeds depends on whether you can authenticate to MotherDuck. By default, the connect() function will open a browser through which you can authenticate to the MotherDuck service using a previously established account.

Alternatively, you can click on Settings in the UI of the MotherDuck console at https://app.motherduck.com/ and download your authentication token. Store this token in the environment variable motherduck_token on your system and you can authenticate and connect directly to MotherDuck with the previous duckdb.connect() statement. It is also possible to list the authentication token directly in the duckdb.connect() function,

con = duckdb.connect('md:?motherduck_token=<token>')

but this is discouraged. Access credentials should not be shown in clear text in a source file that can easily be read and that might be shared with others. That is a security nightmare waiting to happen.

DuckDB and MotherDuck are relational database management systems, they manage more than one database. You can see the list of databases with the SHOW DATABASES command:

con.sql("SHOW DATABASES")

By default, MotherDuck creates a database called my_db and a shared database with sample data. One of the cool features of MotherDuck is the ability to attach local databases to the instance. The following statement attaches a local version of the database for this course to MotherDuck.

con.sql("ATTACH 'ads5064.ddb'")
con.sql("SHOW DATABASES")

Since MotherDuck manages multiple databases you have to tell it which database to work with. This is called the current database, by default this is my_db when you first connect. You can see which database is current with

con.sql("SELECT current_database()").show()

You can set any database as the current one with the USE statement:

con.sql("USE ads5064")
con.sql("SHOW TABLES")

You can access a table in a database that is not the current database by using a two-level name (database.table). This even works if one table is local and the other is in a cloud database. Suppose that ads5064 is a local version of the database for the course and that ads5064_md is stored in the cloud.

con.sql("SELECT count(*) FROM ads5064.apples")
con.sql("SELECT count(*) FROM ads5064_md.glaucoma")

This comes in very handy if you want to process local and cloud data in the same query, for example when joining tables.

MotherDuck also supports shared read-only databases, to which you connect through a URL. The URL is created by an administrator of the database with the CREATE SHARE command. You can then attach the database by using the URL:

con.sql("ATTACH 'md:_share/ads5064_md/e00063c9-568c-45ec-a372-dca85f6915fd' as course_db")

The URL is a link to a point-in-time snapshot of the database when it was shared. The administrator can update the share of the database, the link will remain valid.