import pandas as pd
import polars as pl
import duckdb
= duckdb.connect(database="../ads.ddb", read_only=True)
con
= con.sql('select * from iris').df()
pd_df = con.sql('select * from iris').pl() pd_dl
17 Summarization
We encountered quantitative data summaries in the discussion of data profiling; describe()
in pandas or the sweetviz
visualization produced a number of quantities that summarize properties of the variables in the data set. In this chapter we take a closer look at location and dispersion statistics, important sufficient statistics, and how to summarize categorical data and associations.
The focus of summarization in this chapter is to produce numerical quantities and tabular output. Graphical summaries are covered in Chapter 18.
Definition: Data Summaries
Data summarization is the numerical, tabular, and graphical distillation of the essence of a data set and the relationships between its variables through aggregation.
Summarization reduces \(n\) data points to quantities that describe aspects of the variability in the data. These quantities are called descriptive statistics. For quantitative variables, the statistics measure aspects of the location and dispersion of the data. For qualitative variables, summaries describe unique values and frequencies.
Data summarization serves many purposes. For example:
Profiling. The “first date with the data” according to Borne (2021). See Section 16.2.
Description. What are the central tendencies and the dispersion of the variables? For example, what does the comparison of statistics measuring the central tendency tell us about the distribution of the data, the presence of outliers? What distributional assumptions are reasonable for the data. Are transformations in a feature processing step called for?
Aggregation. Suppose you could not store the raw data but you need to retain information for future statistical processing. What kind of information would you compute and squirrel away? A sufficient statistic is a function of the data that contains all information toward estimating a parameter of the data distribution. For example, the sample mean \(\frac{1}{n}\sum_{i=1}^n Y_i\) is sufficient to estimate the mean of a set of \(n\) independent and identically distributed random variables. If \(Y\) is uniform on \([0,\theta]\), then \(\max\{Y_i\}\) is sufficient for \(\theta\). The quantities we compute during summarization are frequently sufficient statistics; examples are sums, sums of squares, sums of cross-products.
Relationships. Summarization is not only about individual variables, but also about their relationship. The correlation matrix of \(p\) numerical variables is a frequent summary that describes the pairwise linear relationships in the data.
Dimension Reduction. In high-dimensional statistical problems the number of potential input variables is larger than what we can handle (on computational grounds) or should handle (on statistical grounds). Summarization can reduce a \(p\)-dimensional problem to an \(m\)-dimensional problem where \(m \ll p\). Principal component analysis (PCA) relies on matrix factorization (eigenvalue or singular value decomposition) of a crossproduct matrix to find a set of \(m\) linear combinations of the \(p\) input variables that explain a substantial amount of variability in the data. The \(m\) linear combinations summarize the essence of the relationships in the data.
17.1 Location and Dispersion
Table 59.1 shows important location statistics and Table 59.2 important dispersion statistics.
Sample Statistic | Symbol | Computation | Notes |
---|---|---|---|
Min | \(Y_{(1)}\) | The smallest non-missing value | |
Max | \(Y_{(n)}\) | The largest value | |
Mean | \(\overline{Y}\) | \(\frac{1}{n}\sum_{i=1}^n Y_i\) | Most important location measure, but can be affected by outliers |
Median | Med | \(\left \{ \begin{array}{cc} Y_{\left(\frac{n+1}{2}\right)} & n \text{ is even} \\ \frac{1}{2} \left( Y_{\left(\frac{n}{2} \right)} + Y_{\left(\frac{n}{2}+1\right)} \right) & n\text{ is odd} \end{array}\right .\) | Half of the observations are smaller than the median; robust against outliers |
Mode | Mode | The most frequent value; not useful when real numbers are unique | |
1st Quartile | \(Q_1\) | \(Y_{\left(\frac{1}{4}(n+1) \right)}\) | 25% of the observations are smaller than \(Q_1\) |
2nd Quartile | \(Q_2\) = Med | See Median | 50% of the observations are smaller than \(Q_2\). This is the median |
3rd Quartile | \(Q_3\) | \(Y_{\left(\frac{3}{4}(n+1) \right)}\) | 75% of the observations are smaller than \(Q_3\) |
X% Percentile | \(Y_{\left(\frac{X}{100}(n+1) \right)}\) | For example, 5% of the observations are larger than \(P_{95}\), the 95% percentile |
The most important location measures are the sample mean \(\overline{Y}\) and the median. The sample mean is the arithmetic average of the sample values. Because the sample mean can be affected by outliers—large values are pulling the sample mean up—the median is preferred in those situations. For example, in reporting central tendency of annual income, the median income is chosen over the arithmetic average.
Table 59.2 lists important statistics measuring variability (dispersion) of data.
Sample Statistic | Symbol | Computation | Notes |
---|---|---|---|
Range | \(R\) | \(Y_{(n)} - Y_{(1)}\) | \(n\)th order statistic (largest) minus first (smallest) order statistic |
Inter-quartile Range | IQR | \(Q_3 - Q_1\) | Used in constructing box plots; covers the central 50% of the data |
Standard Deviation | \(S\) | \(\sqrt{\frac{1}{n-1}\sum_{i=1}^n\left( Y_i - \overline{Y}\right)^2}\) | Most important dispersion measure; in the same units as the sample mean (the units of \(Y\)) |
Variance | \(S^2\) | \(\frac{1}{n-1}\sum_{i=1}^n \left( Y_i - \overline{Y} \right)^2\) | Important statistical measure of dispersion; in squared units of \(Y\) |
Skewness | \(g_1\) | \(\frac{\frac{1}{n}\sum_{i=1}^n(Y_i - \overline{Y})^3}{S^3}\) | Measures the symmetry of a distribution |
Kurtosis | \(k\) | \(\frac{\frac{1}{n}\sum_{i=1}^n(Y_i - \overline{Y})^4}{S^4}\) | Measures the peakedness of a distribution |
In the parlance of databases, the descriptive statistics in the previous tables are also called aggregates. Aggregation in a database management system (DBMS) is the process of combining records into meaningful quantities called aggregates.
We use the term aggregates in a more narrow sense, sufficient statistics that are used in the calculation of other quantities. If you know what you will use the data for, you can compute the necessary aggregates to perform the work later on. You do not need to access the raw data anymore. An application is the summarization of online or streaming data. Suppose you cannot rewind the data set and run through it again. If every data point can be seen just once, what aggregates do you need to compute to calculate means, variances, correlation coefficients?
Iris Data
We are now calculating basic summary statistics with pandas, polars, and with SQL for the famous Iris data set; a staple of data science and machine learning instruction. The data was used by R.A. Fisher, a pioneer and founder of modern statistics, in a 1936 paper “The use of multiple measurements in taxonomic problems.” The data set contains fifty measurements of four flower characteristics, the length and width of sepals and petals for each of three iris species: Iris setosa, Iris versicolor, and Iris virginica. The petals are the large leaves on the flowers, sepals are the smaller leaves at the bottom of the flower. You will likely see the iris data again, it is used to teach visualization, regression, classification, clustering, etc.
We can load the data from our DuckDB database into pandas and polars DataFrames with the following statements:
Alternatively, you can access the iris data through the scikit-learn
library:
from sklearn.datasets import load_iris
= load_iris()
iris 1:10,] iris.data[
array([[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2],
[5. , 3.6, 1.4, 0.2],
[5.4, 3.9, 1.7, 0.4],
[4.6, 3.4, 1.4, 0.3],
[5. , 3.4, 1.5, 0.2],
[4.4, 2.9, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.1]])
iris.feature_names
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
iris.target_names
array(['setosa', 'versicolor', 'virginica'], dtype='<U10')
A basic set of statistical measures is computed with the describe()
methods of these libraries:
pd_df.describe()
sepal_length sepal_width petal_length petal_width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.054000 3.758667 1.198667
std 0.828066 0.433594 1.764420 0.763161
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
pd_dl.describe()
statistic | sepal_length | sepal_width | petal_length | petal_width | species |
---|---|---|---|---|---|
str | f64 | f64 | f64 | f64 | str |
"count" | 150.0 | 150.0 | 150.0 | 150.0 | "150" |
"null_count" | 0.0 | 0.0 | 0.0 | 0.0 | "0" |
"mean" | 5.843333 | 3.054 | 3.758667 | 1.198667 | null |
"std" | 0.828066 | 0.433594 | 1.76442 | 0.763161 | null |
"min" | 4.3 | 2.0 | 1.0 | 0.1 | "setosa" |
"25%" | 5.1 | 2.8 | 1.6 | 0.3 | null |
"50%" | 5.8 | 3.0 | 4.4 | 1.3 | null |
"75%" | 6.4 | 3.3 | 5.1 | 1.8 | null |
"max" | 7.9 | 4.4 | 6.9 | 2.5 | "virginica" |
Pandas and polars produce very similar output statistics for the numeric variables. Polars adds a row for the number of missing observations (null_count
) and some basic summaries for the qualitative species
variable. Apart from formatting, the results agree. The output column std
represents the sample standard deviation, 25%
, 50%
, and 75%
are the first, second, and third quartile (the 25th, 50th, and 75th percentiles), respectively.
To produce this table of summary statistics with SQL requires a bit more work. To compute the statistics for a particular column, say, sepal_length
,
"select count(sepal_length) as count, \
con.sql( mean(sepal_length) as mean, \
stddev(sepal_length) as std,\
min(sepal_length) as min, \
quantile(sepal_length,.25) as q1, \
quantile(sepal_length,.50) as q2, \
quantile(sepal_length,.75) as q3, \
max(sepal_length) as max from iris")
┌───────┬───────────────────┬────────────────────┬────────┬────────┬────────┬────────┬────────┐
│ count │ mean │ std │ min │ q1 │ q2 │ q3 │ max │
│ int64 │ double │ double │ double │ double │ double │ double │ double │
├───────┼───────────────────┼────────────────────┼────────┼────────┼────────┼────────┼────────┤
│ 150 │ 5.843333333333335 │ 0.8280661279778637 │ 4.3 │ 5.1 │ 5.8 │ 6.4 │ 7.9 │
└───────┴───────────────────┴────────────────────┴────────┴────────┴────────┴────────┴────────┘
And now we have to repeat this for other columns. In this case, SQL is much more verbose and clunky compared to the simple describe()
call. However, the power of SQL becomes clear when you further refine the analysis. Suppose you want to compute the previous result separately for each species in the data set. This is easily done by adding a GROUP BY clause to the SQL statement (group by species
) and listing species
in the SELECT:
"select species, count(sepal_length) as count, \
con.sql( mean(sepal_length) as mean, \
stddev(sepal_length) as std,\
min(sepal_length) as min, \
quantile(sepal_length,.25) as q1, \
quantile(sepal_length,.50) as q2, \
quantile(sepal_length,.75) as q3, \
max(sepal_length) as max from iris group by species")
┌────────────┬───────┬───────────────────┬────────────────────┬────────┬────────┬────────┬────────┬────────┐
│ species │ count │ mean │ std │ min │ q1 │ q2 │ q3 │ max │
│ varchar │ int64 │ double │ double │ double │ double │ double │ double │ double │
├────────────┼───────┼───────────────────┼────────────────────┼────────┼────────┼────────┼────────┼────────┤
│ virginica │ 50 │ 6.587999999999998 │ 0.635879593274432 │ 4.9 │ 6.2 │ 6.5 │ 6.9 │ 7.9 │
│ versicolor │ 50 │ 5.936 │ 0.5161711470638635 │ 4.9 │ 5.6 │ 5.9 │ 6.3 │ 7.0 │
│ setosa │ 50 │ 5.005999999999999 │ 0.3524896872134513 │ 4.3 │ 4.8 │ 5.0 │ 5.2 │ 5.8 │
└────────────┴───────┴───────────────────┴────────────────────┴────────┴────────┴────────┴────────┴────────┘
Interpretation: The average sepal length increases from I. setosa (5.006) to I. versicolor (5.936) to I. virginica (6.588). The variability of the sepal length measurements also increased in that order. You can find I. versicolor plants with large sepal length for that species that exceed the sepal length of the I. virginica specimens at the lower end of the spectrum. That can be gleaned from the proximity of the sample means and the size of the standard deviation. It is confirmed by comparing max and min of the species.
Using SQL here has another important advantage, which we’ll explore further in the section on in-database analytics (Section 17.7). Briefly, rather than moving the raw data from the database to the Python session and performing the summarization there, the summaries are calculated by the database engine and only the results are move to the Python session.
Group-by Processing
A group-by analysis computes analytic results separately for each group of observations. It is a powerful tool to gain insight on how data varies within a group and between groups. Groups are often defined by the unique values of qualitative variables, but they can also be constructed by, for example, binning numeric variables. In the Iris example, the obvious grouping variable is species
.
Going from an ungrouped to a grouped analysis is easy with SQL—just add a GROUP BY clause; we gave an example of this in the previous subsection. With pandas and polars we need to do a bit more work. Suppose we want to compute the min, max, mean, and standard deviation for the petal_width
of I. setosa and I. virginica. This requires filtering records (excluding I. versicolor) and calculating summary statistics separately for the two remaining species.
The syntax for aggregations with group-by is different in pandas and polars. With pandas, you can use a statement such as this:
"species"] != "versicolor"][["species","sepal_width"]] \
pd_df[pd_df["species").agg(['min','max','mean','std']) .groupby(
sepal_width
min max mean std
species
setosa 2.3 4.4 3.418 0.381024
virginica 2.2 3.8 2.974 0.322497
The first set of brackets applies a filter (selects rows), the second set of brackets selects the species
and sepal_width columns
. The resulting DataFrame is then grouped by species
and the groups are aggregated (summarized); four statistics are requested in the aggregation.
Eager and lazy evaluation
The pandas execution model is called eager evaluation. As soon as the request is made, the calculations take place. This is in contrast to lazy evaluation, where the execution of operations is delayed until the results are needed. For example, lazy evaluation of a file read delays the retrieval of data until records need to be processed.
Lazy evaluation has many advantages:
The overall query can be optimized.
Filters (called predicates in data processing parlance) can be pushed down as soon as possible, eliminating loading data into memory for subsequent steps.
Column selections (called projections) can also be done early to reduce the amount of data passed to the following step.
A lazily evaluated query can be **analyzed* prior to execution; restructuring the query can lead to further optimization.
Polars can support eager and lazy evaluation. You can find the list of lazy query optimizations in Polars here.
The Polars syntax to execute the group-by aggregation eagerly is:
filter(pl.col("species") != "versicolor").group_by("species").agg(
pd_dl."sepal_width").min().alias('min'),
pl.col("sepal_width").max().alias('max'),
pl.col("sepal_width").mean().alias('mean'),
pl.col("sepal_width").std().alias('std')
pl.col( )
species | min | max | mean | std |
---|---|---|---|---|
str | f64 | f64 | f64 | f64 |
"virginica" | 2.2 | 3.8 | 2.974 | 0.322497 |
"setosa" | 2.3 | 4.4 | 3.418 | 0.381024 |
The Polars syntax should be familiar to users of the dplyr package in R
. Operations on the DataFrame are piped from step to step. The .filter()
method selects rows, the .group_by()
method groups the filtered data, the .agg()
method aggregates the resulting rows. Since variable sepal_width
is used in multiple aggregates, we are adding an .alias()
to give the calculated statistic a unique name in the output.
The following statements perform lazy evaluation for the same query. The result of the operation is what Polars calls a LazyFrame, rather than a DataFrame. A LazyFrame is a promise on computation. The promise is fulfilled—the query is executed—with q.collect()
. Notice that we call the .lazy()
method on the DataFrame pd_dl
. Because of this, the object q
is a LazyFrame. If we had not specified .lazy()
, q
would be a DataFrame.
= (
q
pd_dl.lazy()filter(pl.col("species") != "versicolor")
."species")
.group_by(
.agg("sepal_width").min().alias('min'),
pl.col("sepal_width").min().alias('max'),
pl.col("sepal_width").mean().alias('mean'),
pl.col("sepal_width").min().alias('std'),
pl.col(
)
)
q.collect()
species | min | max | mean | std |
---|---|---|---|---|
str | f64 | f64 | f64 | f64 |
"setosa" | 2.3 | 2.3 | 3.418 | 2.3 |
"virginica" | 2.2 | 2.2 | 2.974 | 2.2 |
The results of eager and lazy evaluation are identical. The performance and memory requirements can be different, especially for large data sets. Fortunately, the eager API in Polars calls the lazy API under the covers in many cases and collects the results immediately.
You can see the optimized query with
=True) q.explain(optimized
'AGGREGATE[maintain_order: false]\n [col("sepal_width").min().alias("min"), col("sepal_width").min().alias("max"), col("sepal_width").mean().alias("mean"), col("sepal_width").min().alias("std")] BY [col("species")]\n FROM\n FILTER [(col("species")) != ("versicolor")]\n FROM\n DF ["sepal_length", "sepal_width", "petal_length", "petal_width", ...]; PROJECT["sepal_width", "species"] 2/5 COLUMNS'
Streaming execution
Another advantage of lazy evaluation in Polars is the support for streaming execution—where possible. Rather than loading the selected rows and columns into memory, Polars processes the data in batches allowing you to process large data volumes that exceed the memory capacity. To invoke streaming on a LazyFrame, simply add engine="streaming"
to the collection:
="streaming") q.collect(engine
species | min | max | mean | std |
---|---|---|---|---|
str | f64 | f64 | f64 | f64 |
"setosa" | 2.3 | 2.3 | 3.418 | 2.3 |
"virginica" | 2.2 | 2.2 | 2.974 | 2.2 |
Streaming capabilities in Polars are still under development and not supported for all operations. However, the operations for which streaming is supported include many of the important data manipulations, including group-by aggregation:
- filter, select,
- slice, head, tail
- with_columns
- group_by
- join
- sort
- scan_csv, scan_parquet
In the following example we are calculating the sample mean and standard deviation for three variables (num_7
, num_8
, num_9
) grouped by variable cat_1
for a data set with 30 million rows and 18 columns, stored in a Parquet file.
import polars as pl
= ['num_7','num_8', 'num_9']
nums
= (
q3 '../datasets/Parquet/train.parquet')
pl.scan_parquet('cat_1')
.group_by(
.agg('_mean'),
pl.col(nums).mean().name.suffix('_std'),
pl.col(nums).std().name.suffix(
)
)="streaming") q3.collect(engine
cat_1 | num_7_mean | num_8_mean | num_9_mean | num_7_std | num_8_std | num_9_std |
---|---|---|---|---|---|---|
i64 | f64 | f64 | f64 | f64 | f64 | f64 |
2 | 0.500114 | 0.500171 | 0.499994 | 0.288637 | 0.288664 | 0.288766 |
3 | 0.499942 | 0.499939 | 0.500096 | 0.288645 | 0.288696 | 0.288626 |
4 | 0.499922 | 0.500028 | 0.500063 | 0.288682 | 0.288736 | 0.288723 |
1 | 0.499961 | 0.499809 | 0.499931 | 0.288738 | 0.288651 | 0.2887 |
The scan_parquet()
function reads lazily from the file train.parquet
, .group_by()
and .agg()
functions request the statistics for a list of columns. The .suffix()
method adds a string to the variable names to distinguish the results in the output.
The streaming polars code executes on my machine in 0.3 seconds. The pandas equivalent produces identical results in about 4 seconds—an order of magnitude slower with much higher memory consumption.
You can use lazy execution with data stored in other formats. Suppose we want to read data from a DuckDB database and analyze it lazily with minibatch streaming. The following statements do exactly that for the Iris data stored in DuckDB
= (
qduck 'select * from iris').pl().lazy()
con.sql("species", "sepal_width"])
.select([filter(pl.col("species") != "versicolor")
."species")
.group_by(
.agg("sepal_width").min().alias('min'),
pl.col("sepal_width").min().alias('max'),
pl.col("sepal_width").mean().alias('mean'),
pl.col("sepal_width").min().alias('std'),
pl.col(
)
)="streaming") qduck.collect(engine
species | min | max | mean | std |
---|---|---|---|---|
str | f64 | f64 | f64 | f64 |
"setosa" | 2.3 | 2.3 | 3.418 | 2.3 |
"virginica" | 2.2 | 2.2 | 2.974 | 2.2 |
con.close()
17.2 Categorical Data
Summarizing categorical data numerically is a simple matter of calculating absolute frequencies (counts) or relative frequencies (percentages or proportions) of the observations in each category.
We demonstrate with a messy data set from an online manager salary survey from 2021. These data present a number of interesting data management and data quality problems that we will revisit several times. At this time, let’s summarize some of the categorical variables in the data.
Example: Manager Salary Survey
library(duckdb)
<- dbConnect(duckdb(),dbdir = "../ads.ddb",read_only=TRUE)
con <- dbGetQuery(con, "SELECT * FROM SalarySurvey;")
survey_df dbDisconnect(con)
head(survey_df)
Timestamp AgeGroup Industry
1 2021-04-27 11:02:09 25-34 Education (Higher Education)
2 2021-04-27 11:02:21 25-34 Computing or Tech
3 2021-04-27 11:02:38 25-34 Accounting, Banking & Finance
4 2021-04-27 11:02:40 25-34 Nonprofits
5 2021-04-27 11:02:41 25-34 Accounting, Banking & Finance
6 2021-04-27 11:02:45 25-34 Education (Higher Education)
JobTitle AddlContextTitle Salary AddlSalary
1 Research and Instruction Librarian <NA> 55000 0.0
2 Change & Internal Communications Manager <NA> 54600 4000.0
3 Marketing Specialist <NA> 34000 <NA>
4 Program Manager <NA> 62000 3000.0
5 Accounting Manager <NA> 60000 7000.0
6 Scholarly Publishing Librarian <NA> 62000 <NA>
Currency OtherCurrency AddlContextSalary Country StateUS
1 USD <NA> <NA> United States Massachusetts
2 GBP <NA> <NA> United Kingdom <NA>
3 USD <NA> <NA> US Tennessee
4 USD <NA> <NA> USA Wisconsin
5 USD <NA> <NA> US South Carolina
6 USD <NA> <NA> USA New Hampshire
City ExperienceOverall ExperienceField EducationLevel Gender
1 Boston 5-7 years 5-7 years Master's degree Woman
2 Cambridge 8 - 10 years 5-7 years College degree Non-binary
3 Chattanooga 2 - 4 years 2 - 4 years College degree Woman
4 Milwaukee 8 - 10 years 5-7 years College degree Woman
5 Greenville 8 - 10 years 5-7 years College degree Woman
6 Hanover 8 - 10 years 2 - 4 years Master's degree Man
Race
1 White
2 White
3 White
4 White
5 White
6 White
Summarizing the AgeGroup
variable in the survey data one is interested in the number of levels of the variable and the frequency distribution. The unique values of the variable are obtained with the unique
function and the table
function produces the frequency counts. The useNA=
argument requests that observations for which the variable contains a missing value are also counted, if present.
unique
returns the values in the order in which they are encountered in the data frame. table
returns the values in sorted order (same as `sort(unique( ))).
unique(survey_df$AgeGroup)
[1] "25-34" "45-54" "35-44" "18-24" "65 or over"
[6] "55-64" "under 18"
table(survey_df$AgeGroup, useNA="ifany")
18-24 25-34 35-44 45-54 55-64 65 or over under 18
1218 12657 9899 3188 992 94 14
We see that there are no missing values in AgeGroup
, otherwise a NA
column would appear in the table. We also see that the sorted order of the levels is not how we would order the categories. under 18
should be the first category. It appears last because the levels are arranged in the order in which they appear in the data. We notice that the level 18-24
also appears out of order if groups a re arranged by increasing age. To fix this, AgeGroup
is converted into a factor and its levels are rearranged.
<- levels(factor(survey_df$AgeGroup))
AgeGroupLevels $AgeGroup <- factor(survey_df$AgeGroup,
survey_dfc(AgeGroupLevels[7],
1],
AgeGroupLevels[2],
AgeGroupLevels[3],
AgeGroupLevels[4],
AgeGroupLevels[5],
AgeGroupLevels[6]))
AgeGroupLevels[levels(survey_df$AgeGroup)
[1] "under 18" "18-24" "25-34" "35-44" "45-54"
[6] "55-64" "65 or over"
table(survey_df$AgeGroup)
under 18 18-24 25-34 35-44 45-54 55-64 65 or over
14 1218 12657 9899 3188 992 94
To calculate proportions, wrap the table in the proportions
function.
round(proportions(table(survey_df$AgeGroup)),3)
under 18 18-24 25-34 35-44 45-54 55-64 65 or over
0.000 0.043 0.451 0.353 0.114 0.035 0.003
An example of a variable with missing values is EducationLevel
:
table(survey_df$EducationLevel, useNA="ifany")
College degree High School
13519 639
Master's degree PhD
8865 1426
Professional degree (MD, JD, etc.) Some college
1324 2067
<NA>
222
import pandas as pd
import duckdb
= duckdb.connect(database="../ads.ddb", read_only=True)
con
= con.sql('SELECT * FROM SalarySurvey;').df()
survey_df
con.close()
survey_df.head()
Timestamp AgeGroup ... Gender Race
0 2021-04-27 11:02:09.743 25-34 ... Woman White
1 2021-04-27 11:02:21.562 25-34 ... Non-binary White
2 2021-04-27 11:02:38.125 25-34 ... Woman White
3 2021-04-27 11:02:40.643 25-34 ... Woman White
4 2021-04-27 11:02:41.793 25-34 ... Woman White
[5 rows x 18 columns]
Summarizing the AgeGroup
variable in the survey data one is interested in the number of levels of the variable and the frequency distribution. The unique
method of pandas DataFrames returns the levels, value.counts
returns the countsn.
"AgeGroup"].unique() survey_df[
array(['25-34', '45-54', '35-44', '18-24', '65 or over', '55-64',
'under 18'], dtype=object)
"AgeGroup"].value_counts() survey_df[
AgeGroup
25-34 12657
35-44 9899
45-54 3188
18-24 1218
55-64 992
65 or over 94
under 18 14
Name: count, dtype: int64
To return proportions rather than counts, use normalize=True
:
"AgeGroup"].value_counts(normalize=True) survey_df[
AgeGroup
25-34 0.451037
35-44 0.352755
45-54 0.113606
18-24 0.043404
55-64 0.035350
65 or over 0.003350
under 18 0.000499
Name: proportion, dtype: float64
By default the levels of the variable are sorted in descending order. That sort order is not how we would order the categories. under 18
should be the first category, followed by 18-24
. It appears last because the levels are arranged in the order in which they appear in the data. We notice that the level 18-24
also appears out of order if groups are arranged by increasing age. To fix this, you can create a different index for the levels and call reindex()
=['under 18', '18-24', '25-34', '45-54','55-64','65 or over']
new_order"AgeGroup"].value_counts().reindex(new_order) survey_df[
AgeGroup
under 18 14
18-24 1218
25-34 12657
45-54 3188
55-64 992
65 or over 94
Name: count, dtype: int64
An example of a variable with missing values is EducationLevel
. Set the dropna=
option to False
to count missing values. The default is dropna=True
.
"EducationLevel"].value_counts(dropna=False) survey_df[
EducationLevel
College degree 13519
Master's degree 8865
Some college 2067
PhD 1426
Professional degree (MD, JD, etc.) 1324
High School 639
None 222
Name: count, dtype: int64
17.3 Distribution
Summarizing the distribution of data typically uses graphics such as density plots, histograms, and box plots. The box plot is one of the most useful visual summaries of the distribution, it combines information about central tendency (location), dispersion, and extreme observations (Figure 17.2).

Five Number Summary
We cover this technique here, because a box plot is based on five summary statistics, sometimes called Tukey’s five-number summary. The five numbers are
- the lower whisker
- the lower hinge of the box (\(Q_1\))
- the median (\(Q_2\))
- the upper hinge of the box (\(Q_3\))
- the upper whisker
The lower and upper whiskers extend to the value of an observation within \(1.5\times \text{IQR}\), where IQR is the inter-quartile range, IQR = \(Q_3 - Q_1\).
Example: Manager Salary Survey (Cont’d)
The five-number summary can be obtained with the boxplot.stats
function in R
. The hinges in the five-number summary are versions of the first and third quartile. There are multiple ways in which percentiles (and thus quartiles) are calculated, depending on how ties are resolved and whether neighboring values are interpolated.
For the Salary
column in the survey data we obtain the following:
<- boxplot.stats(survey_df$Salary)
bps $stats bps
[1] 0 54000 75000 110000 194000
The lower whisker extends to 0, the smallest value and the upper whisker to 194000. The median salary is 75000.
# Number of outliers (total, below, above)
length(bps$out)
[1] 1186
sum(bps$out < bps$stats[1])
[1] 0
sum(bps$out > bps$stats[5])
[1] 1186
# quantile(survey_df$Salary,c(1,3)/4)
There are 1186 outliers, all of which lie beyond the upper whisker.
You can calculate the five-number summary in Python by either writing a custom function or by extracting the results from the matplotlib
function that produces box plots.
Here is the first approach:
import numpy as np
def boxplot_stats_1(data):
= np.percentile(data, [25, 50, 75])
q1, median, q3 = q3 - q1
iqr
# Calculate fence boundaries
= q1 - 1.5 * iqr
lower_fence = q3 + 1.5 * iqr
upper_fence
# Find whisker extents (furthest points within the fences)
= min([x for x in data if x >= lower_fence])
lower_whisker = max([x for x in data if x <= upper_fence])
upper_whisker
return [lower_whisker, q1, median, q3, upper_whisker]
= boxplot_stats_1(survey_df['Salary'])
five_num print(five_num)
[0.0, 54000.0, 75000.0, 110000.0, 194000.0]
Here is the version that extracts the five-number summary from the boxplot
function in matplotlib
.
import matplotlib.pyplot as plt
import numpy as np
def boxplot_stats_2(data):
= plt.subplots()
fig, ax = ax.boxplot(data, patch_artist=True)
bp
plt.close(fig)
# Extract whisker extents
= bp['whiskers'][0].get_ydata()[1]
whisker_bottom = bp['whiskers'][1].get_ydata()[1]
whisker_top
# Extract quartiles and median
= bp['boxes'][0].get_path().vertices[0:5, 1][[1, 2]]
q1, q3 = bp['medians'][0].get_ydata()[0]
median
return [whisker_bottom, q1, median, q3, whisker_top]
= boxplot_stats_2(survey_df['Salary'])
five_num print(five_num)
[0.0, 54000.0, 75000.0, 110000.0, 194000.0]
Stem and Leaf Plot
Another summary of the distribution that is both numerical and graphical, is the stem-and-leaf plot, introduced by Tukey (1977). The stem-and-leaf plot conveys information about the shape of the distribution while retaining the numerical information.
The stem
function in the default graphics library produces the stem-and-leaf plot. The following code generates the display for the sepal length variable in the iris data set. Note that the sepal lengths range from 4.3 to 7.9 in the data.
summary(iris$Sepal.Length)
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.300 5.100 5.800 5.843 6.400 7.900
stem(iris$Sepal.Length)
The decimal point is 1 digit(s) to the left of the |
42 | 0
44 | 0000
46 | 000000
48 | 00000000000
50 | 0000000000000000000
52 | 00000
54 | 0000000000000
56 | 00000000000000
58 | 0000000000
60 | 000000000000
62 | 0000000000000
64 | 000000000000
66 | 0000000000
68 | 0000000
70 | 00
72 | 0000
74 | 0
76 | 00000
78 | 0
The values, up to the first decimal point, to the left of the “|” marker are the stem. The leafs on each stem represent a histogram of the values. The values on the stem are binned somewhat, since the smallest value is 4.3 and the largest is 7.9.
Changing the scale of the plot stretches (scale > 1) or compresses (scale < 1) the histogram.
stem(iris$Sepal.Length, scale=0.5)
The decimal point is at the |
4 | 3444
4 | 566667788888999999
5 | 000000000011111111122223444444
5 | 5555555666666777777778888888999
6 | 00000011111122223333333334444444
6 | 5555566777777778889999
7 | 0122234
7 | 677779
With a scale of 0.5, the stem and leaf values are separated at the decimal point. The first row of the display represents the values 4.3, 4.4, 4.4, and 4.4. The last row represents the values 7.6, 7.7, 7.7, 7.7, 7.7, and 7.9.
The stemgraphic
library produces stem-and-leaf plots in Python. The values are arranged in descending order. The left-most column is a cumulative observation count.
import stemgraphic
from sklearn.datasets import load_iris
= load_iris()
iris
0], scale=1) stemgraphic.stem_graphic(iris.data[:,
Applying a scaling factor changes the values displayed on the stem, in contrast to the R
implementation.
0], scale=0.5) stemgraphic.stem_graphic(iris.data[:,
Notice that there is a also a stem
plot method in matplotlib.pyplot, but that is a different plot type.
The stem-and-leaf plot is useful to get a quick view of the shape of the distribution as well as the values. With large data sets, the display is not as helpful as it might not be possible to display all the values on the stems. Here is the default stem-and-leaf plot for log salary in the survey data.
The decimal point is at the |
0 | 0
1 | 46
2 | 79
3 | 3456666666677777788999999
4 | 00000000000000111112222222223334444455566667777899
5 | 000122569
6 | 124899
7 | 1344578888
8 | 01333455566677888899
9 | 00111222222333333333444444444444444445555555555555555555555556666666+173
10 | 00000000000000000000000000000000000000000000000000000000000000000000+7437
11 | 00000000000000000000000000000000000000000000000000000000000000000000+17383
12 | 00000000000000000000000000000000000000000000000000000000000000000000+2397
13 | 00000000000000000000111111111111111111111112222222222222223333333333+65
14 | 000000000111111222345566678899
15 | 00112223334445666999
16 | 00139
17 | 1456
18 | 46
19 | 0
20 | 6
21 |
22 | 5
17.4 Relationships
Continuous Data
The usual statistic to express the relationship between two continuous variables is the Pearson product-moment correlation coefficient, or correlation coefficient for short: \[ r_{xy} = \frac{\sum_{i=1}^n (x_i-\overline{x})(y_i - \overline{y})} {\sqrt{\sum_{i=1}^n (x_i-\overline{x})^2\,\sum_{i=1}^n(y_i-\overline{y})^2}} \]
The correlation coefficient \(r_{xy}\) is an estimator of the correlation, a property of the joint distribution \(f(x,y)\) of two random variables: \[ \text{Corr}[X,Y] = \frac{\text{E}\left[(X-\text{E}[X])(Y-\text{E}[Y])\right]}{\sqrt{\text{Var}[X]\text{Var}[Y]}} \] The numerator of the correlation is the covariance \[ \text{Cov}[X] = \text{E}\left[(X-\text{E}[X])(Y-\text{E}[Y])\right] = \text{E}[X]\text{E}[Y] - \text{E}[XY] \]
Example: Fitness Data
The cor
function in R
computes pairwise correlations between all numerical columns of a matrix (or data frame) or between the columns of two matrices (or data frames) if you specify the x=
and y=
arguments.
<- dbConnect(duckdb(),dbdir = "../ads.ddb",read_only=TRUE)
con <- dbGetQuery(con, "SELECT * FROM fitness;")
fitness dbDisconnect(con)
round(cor(fitness),6)
Age Weight Oxygen RunTime RestPulse RunPulse MaxPulse
Age 1.000000 -0.233539 -0.304592 0.188745 -0.164100 -0.337870 -0.432916
Weight -0.233539 1.000000 -0.162753 0.143508 0.043974 0.181516 0.249381
Oxygen -0.304592 -0.162753 1.000000 -0.862195 -0.399356 -0.397974 -0.236740
RunTime 0.188745 0.143508 -0.862195 1.000000 0.450383 0.313648 0.226103
RestPulse -0.164100 0.043974 -0.399356 0.450383 1.000000 0.352461 0.305124
RunPulse -0.337870 0.181516 -0.397974 0.313648 0.352461 1.000000 0.929754
MaxPulse -0.432916 0.249381 -0.236740 0.226103 0.305124 0.929754 1.000000
round(cor(x=fitness,y=fitness$Oxygen),6)
[,1]
Age -0.304592
Weight -0.162753
Oxygen 1.000000
RunTime -0.862195
RestPulse -0.399356
RunPulse -0.397974
MaxPulse -0.236740
To compute the matrix of all pairwise correlations between the numeric columns in a pandas DataFrame, you can use the pandas corr
method. To compute the correlations between columns of one data frame and columns of another use the corrwith
method. The latter is also convenient to compute correlations of one variable with all others in a DataFrame.
import pandas as pd
import duckdb
= duckdb.connect(database="../ads.ddb", read_only=True)
con
= con.sql('SELECT * FROM Fitness;').df()
fitness
con.close()
'display.max_columns', None)
pd.set_option( fitness.corr()
Age Weight Oxygen RunTime RestPulse RunPulse \
Age 1.000000 -0.233539 -0.304592 0.188745 -0.164100 -0.337870
Weight -0.233539 1.000000 -0.162753 0.143508 0.043974 0.181516
Oxygen -0.304592 -0.162753 1.000000 -0.862195 -0.399356 -0.397974
RunTime 0.188745 0.143508 -0.862195 1.000000 0.450383 0.313648
RestPulse -0.164100 0.043974 -0.399356 0.450383 1.000000 0.352461
RunPulse -0.337870 0.181516 -0.397974 0.313648 0.352461 1.000000
MaxPulse -0.432916 0.249381 -0.236740 0.226103 0.305124 0.929754
MaxPulse
Age -0.432916
Weight 0.249381
Oxygen -0.236740
RunTime 0.226103
RestPulse 0.305124
RunPulse 0.929754
MaxPulse 1.000000
'Oxygen']) fitness.corrwith(fitness[
Age -0.304592
Weight -0.162753
Oxygen 1.000000
RunTime -0.862195
RestPulse -0.399356
RunPulse -0.397974
MaxPulse -0.236740
dtype: float64
The correlation matrix is a symmetric matrix since \(r_{xy} = r_{yx}\) and the values of the diagonal are 1.0 since a variable is perfectly correlated with itself, \(r_{xx} = 1.\)
Categorical Data
Summarizing the relationship between two categorical variables uses cross-tabulations of counts and proportions.
Example: Manager Salary Survey
The table
function in base
R
computes single tables as well as more complex cross-tabulations. The following statement requests a two-way table of overall professional experience level and AgeGroup
in the survey data.
table(survey_df$ExperienceOverall,survey_df$AgeGroup)
under 18 18-24 25-34 35-44 45-54 55-64 65 or over
1 year or less 4 314 185 16 3 1 0
11 - 20 years 0 3 1898 6989 668 61 5
2 - 4 years 5 746 2168 83 18 6 0
21 - 30 years 2 1 6 1206 2107 302 13
31 - 40 years 1 1 2 4 303 523 35
41 years or more 1 1 1 1 0 81 39
5-7 years 1 134 4307 404 28 7 1
8 - 10 years 0 18 4090 1196 61 11 1
The levels of AgeGroup
had been arranged previously to increase monotonically. The levels of ExperienceOverall
have not been adjusted, they are listed in the order in which they appear in the data. The following code turns ExperienceOverall
into a factor and rearranges the levels.
<- levels(factor(survey_df$ExperienceOverall))
ExperienceLevels ExperienceLevels
[1] "1 year or less" "11 - 20 years" "2 - 4 years" "21 - 30 years"
[5] "31 - 40 years" "41 years or more" "5-7 years" "8 - 10 years"
$ExperienceOverall <- factor(survey_df$ExperienceOverall,
survey_dfc(ExperienceLevels[1],
3],
ExperienceLevels[7],
ExperienceLevels[8],
ExperienceLevels[2],
ExperienceLevels[4],
ExperienceLevels[5],
ExperienceLevels[6]))
ExperienceLevels[levels(survey_df$ExperienceOverall)
[1] "1 year or less" "2 - 4 years" "5-7 years" "8 - 10 years"
[5] "11 - 20 years" "21 - 30 years" "31 - 40 years" "41 years or more"
table(survey_df$ExperienceOverall,survey_df$AgeGroup)
under 18 18-24 25-34 35-44 45-54 55-64 65 or over
1 year or less 4 314 185 16 3 1 0
2 - 4 years 5 746 2168 83 18 6 0
5-7 years 1 134 4307 404 28 7 1
8 - 10 years 0 18 4090 1196 61 11 1
11 - 20 years 0 3 1898 6989 668 61 5
21 - 30 years 2 1 6 1206 2107 302 13
31 - 40 years 1 1 2 4 303 523 35
41 years or more 1 1 1 1 0 81 39
When calculating the cross-tabulation of AgeGroup
and overall experience level with pandas crosstab
method, the rows and columns are arranged in the order in which they are encountered in the data.
= pd.crosstab(survey_df['ExperienceOverall'],survey_df['AgeGroup'])
ct ct
AgeGroup 18-24 25-34 35-44 45-54 55-64 65 or over under 18
ExperienceOverall
1 year or less 314 185 16 3 1 0 4
11 - 20 years 3 1898 6989 668 61 5 0
2 - 4 years 746 2168 83 18 6 0 5
21 - 30 years 1 6 1206 2107 302 13 2
31 - 40 years 1 2 4 303 523 35 1
41 years or more 1 1 1 0 81 39 1
5-7 years 134 4307 404 28 7 1 1
8 - 10 years 18 4090 1196 61 11 1 0
This is not optimal and makes it difficult to read the information and to spot possible data problems. To change the order of the rows or columns, we can define the appropriate order and apply the reindex
method to the cross-tabulation. To reindex the rows, use axis='index'
, to reindex the columns use axis='columns'
(or use axis=0
and axis=1
).
=['under 18', '18-24', '25-34', '45-54','55-64','65 or over']
age_order=['1 year or less', '2 - 4 years', '5-7 years','8 - 10 years',
exp_order'11 - 20 years','21 - 30 years','31 - 40 years', '41 years or more']
= ct.reindex(age_order, axis='columns').reindex(exp_order, axis="index")
ct ct
AgeGroup under 18 18-24 25-34 45-54 55-64 65 or over
ExperienceOverall
1 year or less 4 314 185 3 1 0
2 - 4 years 5 746 2168 18 6 0
5-7 years 1 134 4307 28 7 1
8 - 10 years 0 18 4090 61 11 1
11 - 20 years 0 3 1898 668 61 5
21 - 30 years 2 1 6 2107 302 13
31 - 40 years 1 1 2 303 523 35
41 years or more 1 1 1 0 81 39
After rearranging the rows and columns we see immediately a data quality problem with the survey data. How can employees have more overall professional experience than they are years old? How does someone under 18 years of age have 21–30 years of experience? Employees age 55–64 with less than one year of professional experience also raises questions about the quality of the data.
17.5 Aggregates
Important aggregates that allow us to estimate other quantities are based on sums and sums of squares. Specifically,
Aggregate | Formula | Symbol | Used For |
---|---|---|---|
Sum | \(\sum_{i=1}^n y_i\) | \(S_y\) | Mean, variance, std. deviation, … |
Sum of squares | \(\sum_{i=1}^n y_i^2\) | \(S_{yy}\) | Variance, std. deviation |
Sum of cross-products | \(\sum_{i=1}^n x_i y_i\) | \(S_{xy}\) | Covariance, correlation |
Number of observations | \(n = \sum_{i=1}^n 1\) |
These quantities are called the raw sum, sum of squares, and sum of cross-products.
As mentioned previously, aggregates are handy summaries to pre-compute quantities based on which you can later compute other quantities of interest. Suppose you are processing data that is streamed to an application. Every record can be seen just once, you cannot rewind the stream and run through the data again. The stream consists of three variables, \(X\), \(Y\), and \(Z\). What are the quantities you need to aggregate while the stream is processed so that you can calculate the stream means \(\overline{x}\), \(\overline{y}\), \(\overline{z}\), the standard deviations \(s_x\), \(s_y\), \(s_z\), and the correlations \(r_{xy}\), \(r_{xz}\), \(r_{yz}\)?
If we were to calculate the sample standard deviation based on the formula \[ s_y = \sqrt{\frac{\sum_{i=1}^n \left(y_i - \overline{y}\right)^2}{n-1}} \tag{17.1}\]
we need two passes through the data. A pass to calculate \(\overline{y}\) (and \(n\)) and a pass to calculate the sum of the squared differences \((y_i-\overline{y})^2\). If you rewrite the formula as \[ s_y = \sqrt{\frac{1}{n-1}\left(\sum_{i=1}^n y_i^2 - \left(\sum_{i=1}^n y_i\right)^2/n\right)} \tag{17.2}\]
then we can calculate the aggregates \(S_y\), \(S_{yy}\), and \(n\), on a single pass through the data and assemble the standard deviation afterwards as \[ s_y = \sqrt{\frac{1}{n-1}\left(S_{yy}-S_y^2/n\right)} \]
Let’s simulate 1,000 observations from three independent Gaussian(0,4) distributions and compute the means \(\overline{x}\), \(\overline{y}\), \(\overline{z}\), the standard deviations \(s_x\), \(s_y\), \(s_z\), and the correlation coefficients \(r_{xy}\), \(r_{xz}\), \(r_{yz}\) in a single pass through the data.
library(Matrix)
set.seed(234)
<- matrix(rnorm(3000,mean=0,sd=2),ncol=3,nrow=1000)
dat colnames(dat) <- c("X","Y","Z")
# Compute the sufficient statistics in a single pass
# through the data
<- function(X) {
single_pass <- dim(X)[1]
n <- dim(X)[2]
p <- rep(0,p)
S <- rep(0,p)
S2 <- rep(0,p*(p-1)/2)
Sxy for (i in 1:n) {
<- S + X[i,]
S <- S2 + X[i,]^2
S2 # Extracting the lower triangular values of the cross-product
# matrix to compute Sxy.
<- tcrossprod(X[i,])
cp upper.tri(cp,diag=T)] <- NA
cp[<- Sxy + na.omit(as.vector(cp))
Sxy
}return(list(n=n,S=S,S2=S2,Sxy=Sxy))
}
<- single_pass(dat)
s_st s_st
$n
[1] 1000
$S
X Y Z
-43.22597 12.23111 -39.06777
$S2
X Y Z
3878.278 3965.424 3676.894
$Sxy
[1] -98.12442 70.45634 125.67237
attr(,"na.action")
[1] 1 4 5 7 8 9
attr(,"class")
[1] "omit"
# Compute the statistics
<- s_st$S/s_st$n
means <- s_st$S2 - ((s_st$S)^2)/s_st$n
vars <- vars/(s_st$n-1)
vars <- sqrt(vars)
sds
cat("Sample means from sufficient statistics" , round(means,4),"\n")
Sample means from sufficient statistics -0.0432 0.0122 -0.0391
cat("Sample variances from sufficient statistics", round(vars,4) ,"\n")
Sample variances from sufficient statistics 3.8803 3.9692 3.679
cat("Sample std. devs from sufficient statistics", round(sds,4) ,"\n")
Sample std. devs from sufficient statistics 1.9698 1.9923 1.9181
#Verification
apply(dat,2,mean)
X Y Z
-0.04322597 0.01223111 -0.03906777
apply(dat,2,var)
X Y Z
3.880290 3.969244 3.679047
apply(dat,2,sd)
X Y Z
1.969845 1.992296 1.918084
<- (s_st$Sxy[1] - s_st$S[1]*s_st$S[2]/s_st$n)/
corxy sqrt((s_st$S2[1]-s_st$S[1]^2/s_st$n)*(s_st$S2[2]-s_st$S[2]^2/s_st$n))
<- (s_st$Sxy[2] - s_st$S[1]*s_st$S[3]/s_st$n)/
corxz sqrt((s_st$S2[1]-s_st$S[1]^2/s_st$n)*(s_st$S2[3]-s_st$S[3]^2/s_st$n))
<- (s_st$Sxy[3] - s_st$S[2]*s_st$S[3]/s_st$n)/
coryz sqrt((s_st$S2[2]-s_st$S[2]^2/s_st$n)*(s_st$S2[3]-s_st$S[3]^2/s_st$n))
cat("Corr(X,Y)", round(corxy,4), "\n")
Corr(X,Y) -0.0249
cat("Corr(X,Z)", round(corxz,4), "\n")
Corr(X,Z) 0.0182
cat("Corr(Y,Z)", round(coryz,4), "\n")
Corr(Y,Z) 0.033
# Verification
round(cor(dat),4)
X Y Z
X 1.0000 -0.0249 0.0182
Y -0.0249 1.0000 0.0330
Z 0.0182 0.0330 1.0000
import numpy as np
import pandas as pd
234)
np.random.seed(=True)
np.set_printoptions(suppress
= np.random.normal(loc=0, scale=2, size=(1000, 3))
dat = ["X", "Y", "Z"]
column_names = pd.DataFrame(dat, columns=column_names)
dat_df
def single_pass(X):
"""
Compute the sufficient statistics in a single pass through the data
"""
= X.shape
n, p = np.zeros(p)
S = np.zeros(p)
S2 = np.zeros(p * (p - 1) // 2)
Sxy
for i in range(n):
= X[i, :]
row += row
S += row**2
S2
# Compute cross products for lower triangle
= np.outer(row, row)
cp # Get lower triangle indices (excluding diagonal)
= np.tril_indices(p, k=-1)
lower_tri_indices = cp[lower_tri_indices]
cross_products += cross_products
Sxy
return {
'n': n,
'S': S,
'S2': S2,
'Sxy': Sxy
}
# Compute sufficient statistics
= single_pass(dat)
s_st print("Sufficient statistics:")
Sufficient statistics:
print(f"n: {s_st['n']}")
n: 1000
print(f"S: {s_st['S']}")
S: [ 49.39470037 10.38558094 -39.33627905]
print(f"S2: {s_st['S2']}")
S2: [4295.34809822 4079.78232515 3701.10077859]
print(f"Sxy: {s_st['Sxy']}")
Sxy: [12.84973181 38.51039105 -1.4595958 ]
# Compute the statistics
= s_st['S'] / s_st['n']
means vars = s_st['S2'] - (s_st['S']**2) / s_st['n']
vars = vars / (s_st['n'] - 1)
= np.sqrt(vars)
sds
print(f"\nSample means from sufficient statistics: {np.round(means, 4)}")
Sample means from sufficient statistics: [ 0.0494 0.0104 -0.0393]
print(f"Sample variances from sufficient statistics: {np.round(vars, 4)}")
Sample variances from sufficient statistics: [4.2972 4.0838 3.7033]
print(f"Sample std. devs from sufficient statistics: {np.round(sds, 4)}")
Sample std. devs from sufficient statistics: [2.073 2.0208 1.9244]
# Verification using numpy/pandas built-in functions
print(f"\nVerification:")
Verification:
print(f"NumPy means: {np.round(np.mean(dat, axis=0), 4)}")
NumPy means: [ 0.0494 0.0104 -0.0393]
print(f"NumPy variances: {np.round(np.var(dat, axis=0, ddof=1), 4)}")
NumPy variances: [4.2972 4.0838 3.7033]
print(f"NumPy std devs: {np.round(np.std(dat, axis=0, ddof=1), 4)}")
NumPy std devs: [2.073 2.0208 1.9244]
# Compute correlations from sufficient statistics
= s_st['n']
n = s_st['S']
S = s_st['S2']
S2 = s_st['Sxy']
Sxy
= (Sxy[0] - S[0] * S[1] / n) / np.sqrt((S2[0] - S[0]**2 / n) * (S2[1] - S[1]**2 / n))
corxy = (Sxy[1] - S[0] * S[2] / n) / np.sqrt((S2[0] - S[0]**2 / n) * (S2[2] - S[2]**2 / n))
corxz = (Sxy[2] - S[1] * S[2] / n) / np.sqrt((S2[1] - S[1]**2 / n) * (S2[2] - S[2]**2 / n))
coryz
print(f"\nCorr(X,Y): {round(corxy, 4)}")
Corr(X,Y): 0.0029
print(f"Corr(X,Z): {round(corxz, 4)}")
Corr(X,Z): 0.0102
print(f"Corr(Y,Z): {round(coryz, 4)}")
Corr(Y,Z): -0.0003
# Verification using numpy correlation
print(f"\nVerification - Correlation matrix:")
Verification - Correlation matrix:
= np.corrcoef(dat.T)
corr_matrix print(np.round(corr_matrix, 4))
[[ 1. 0.0029 0.0102]
[ 0.0029 1. -0.0003]
[ 0.0102 -0.0003 1. ]]
17.6 Dimension Reduction
Tabular (row-column) data can be high-dimensional in two ways: many rows and/or many columns. The data summaries and aggregates discussed above essentially perform dimension reduction across rows of data. If you want to calculate the sample mean you do not need to keep \(Y_1, \cdots, Y_n\); two aggregates suffice: \(n\) and \(S_y\).
More often, dimension reduction refers to the elimination or aggregation of columns of data. Suppose we have \(p\) input columns in the data, and that \(p\) is large. Do we need all of the input variables? Can we reduce their number
without sacrificing (to much) information?
Two general approaches to reducing the number of input variables are feature selection and component analysis. Feature selection uses analytical techniques to identify which features are promising to be used in the analysis and eliminates the others. Principal component analysis uses information from all \(p\) inputs to construct a smaller set of \(m\) inputs that explain a sufficiently large amount of variability among the inputs.
Principal Component Analysis (PCA)
PCA is one of the oldest statistical techniques, it was developed by Karl Pearson in 1901. It still plays an important role in machine learning and data science today. Dimension reduction is just one of the applications of PCA; it is also used in visualizing high-dimensional data and in data imputation.
Suppose we have \(n\) observations on \(p\) inputs. Denote as \(x_{ij}\) the \(i\)th observation on the \(j\)th input variable. A principal component is a linear combination of the \(p\) inputs. The elements of the \(j\)th component score vector are calculated as follows:
\[ z_{ij} = \psi_{1j} x_{i1} + \psi_{2j} x_{i2} + \cdots + \psi_{pj} x_{ip} \quad i=1,\cdots,n; j=1,\cdots,p \] The coefficients \(\psi_{kj}\) are called the rotations or loadings of the principal components. The \(z_{ij}\) are called the PCA scores.
In constructing the linear combination, the \(x\)s are always centered to have mean zero and are usually scaled to have standard deviation one.
The scores \(z_{ij}\) are then collected into a score vector \(\textbf{z}_j = [z_{1j}, \cdots, z_{nj}]\) for the \(j\)th component. Note that for each observation \(i\) there are \(p\) inputs and \(p\) scores. So what have we gained? Instead of the \(p\) vectors \(\textbf{x}_1, \cdots, \textbf{x}_p\) we now have the \(p\) vectors \(\textbf{z}_1, \cdots, \textbf{z}_p\).
Because of the way in which the \(\psi_{kj}\) are found, the PCA scores have very special properties:
They have zero mean (if the data were centered):\(\sum_{i=1}^n z_{ij} = 0\)
They are uncorrelated: \(\text{Corr}[z_j, z_k] = 0, \forall j \ne k\)
\(\text{Var}\left[\sum_{j=1}^p z_j\right] = \sum_{j=1}^p \text{Var}[z_j] = p\) (if data was scaled)
The components can be ordered in terms of their variance \(\Rightarrow \text{Var}[z_1] > \text{Var}[z_2] > \cdots > \text{Var}[z_p]\)
This is useful in many ways:
The components provide a decomposition of the variability in the data. The first component explains the greatest proportion of the variability, the second component explains the next largest proportion of variability, and so forth.
The contributions of the components are independent, each component is orthogonal to the other components. For example the second component is a linear combination of the \(X\)s that models a relationship not captured by the first (or any other) component.
The loadings for each component inform about the particular linear relationship. The magnitude of the loading coefficients reflects which inputs are driving the linear relationship. For example, when all coefficients are of similar magnitude, the component expresses some overall, average trend. When a few coefficients stand out it suggests that the associated attributes explain most of the variability captured by the component.
To summarize high-dimensional data, we can ignore those components that explain a small amount of variability and turn a high-dimensional problem into a lower-dimensional one. Often, only the first two principal components are used to visualize the data and for further analysis.
(PCA) finds linear combinations of \(p\) inputs that explain decreasing amounts of variability among the \(x\)’s. Not any linear combination will do, the principal components are orthogonal to each other and project in the directions in which the inputs are most variable. That means they decompose the variability in the inputs into non-overlapping chunks.
Consider the following data set with \(n=10\) and \(p=4\).
Obs | \(X_1\) | \(X_2\) | \(X_3\) | \(X_4\) |
---|---|---|---|---|
1 | 0.344 | 0.364 | 0.806 | 0.160 |
2 | 0.363 | 0.354 | 0.696 | 0.249 |
3 | 0.196 | 0.189 | 0.437 | 0.248 |
4 | 0.200 | 0.212 | 0.590 | 0.160 |
5 | 0.227 | 0.229 | 0.437 | 0.187 |
6 | 0.204 | 0.233 | 0.518 | 0.090 |
7 | 0.197 | 0.209 | 0.499 | 0.169 |
8 | 0.165 | 0.162 | 0.536 | 0.267 |
9 | 0.138 | 0.116 | 0.434 | 0.362 |
10 | 0.151 | 0.151 | 0.483 | 0.223 |
\(\overline{x}_j\) | 0.218 | 0.222 | 0.544 | 0.211 |
\(s_j\) | 0.076 | 0.081 | 0.122 | 0.075 |
When the \(x\)s are centered with their means and scaled by their standard deviations, and the principal components are computed, the matrix of loadings is \[ \boldsymbol{\Phi} = \left [ \begin{array}{r r r r } -0.555 & 0.235 & 0.460 & -0.652\\ -0.574 &0.052 &0.336 & 0.745\\ -0.530 &0.208 &-0.821 &-0.053\\ 0.286 &0.948 &0.047 & 0.132\\ \end{array} \right] \tag{17.3}\]
This matrix can now be used, along with the data in Table 17.3 to compute the scores \(Z_{im}\). For example,
\[ Z_{11} = -0.555 \frac{0.344-0.218}{0.076} -0.574\frac{0.364-0.222}{0.081} - 0.530\frac{0.806-0.544}{0.122} + 0.286\frac{0.16-0.211}{0.075} = -3.248 \]
\[ Z_{32} = 0.235 \frac{0.196-0.218}{0.076} +0.052\frac{0.189-0.222}{0.081} + 0.208\frac{0.437-0.544}{0.122} +0.948\frac{0.248-0.211}{0.075} = 0.194 \]
Table 17.4 displays the four inputs and the four scores \(Z_1, \cdots Z_4\) from the PCA.
Obs | \(X_1\) | \(X_2\) | \(X_3\) | \(X_4\) | \(Z_1\) | \(Z_2\) | \(Z_3\) | \(Z_4\) |
---|---|---|---|---|---|---|---|---|
1 | 0.344 | 0.364 | 0.806 | 0.160 | -3.248 | 0.282 | -0.440 | 0.029 |
2 | 0.363 | 0.354 | 0.696 | 0.249 | -2.510 | 1.257 | 0.425 | -0.025 |
3 | 0.196 | 0.189 | 0.437 | 0.248 | 0.993 | 0.194 | 0.465 | 0.001 |
4 | 0.200 | 0.212 | 0.590 | 0.160 | -0.191 | -0.629 | -0.498 | -0.041 |
5 | 0.227 | 0.229 | 0.437 | 0.187 | 0.255 | -0.459 | 0.780 | -0.001 |
6 | 0.204 | 0.233 | 0.518 | 0.090 | -0.329 | -1.614 | 0.055 | 0.023 |
7 | 0.197 | 0.209 | 0.499 | 0.169 | 0.286 | -0.691 | 0.087 | 0.012 |
8 | 0.165 | 0.162 | 0.536 | 0.267 | 1.058 | 0.479 | -0.485 | 0.009 |
9 | 0.138 | 0.116 | 0.434 | 0.362 | 2.380 | 1.391 | -0.098 | 0.023 |
10 | 0.151 | 0.151 | 0.483 | 0.223 | 1.306 | -0.210 | -0.291 | -0.028 |
For each observation there is a corresponding component score and there are as many components as there are input variables. Although there is a 1:1 correspondence at the row level, there is no such correspondence at the column level. Instead, each PCA score \(Z_j\) is a linear combination of all \(p\) input variables. Even if we were to proceed with using only \(Z_1\) in a linear model \[ Y_i = \theta_0 + \theta_1 Z_{i1} + \epsilon_i \] the model contains information from \(X_1\) through \(X_4\) because \[ Z_{i1} = \sum_{j=1}^p \phi_{j1}X_{ij} \]
Table 17.5 shows sample statistics for the raw data and the principal component scores. The sample variances of the scores decreases from \(Z_1\) through \(Z_4\). The first principal component, \(Z_1\), explains \(2.957/4 \times 100\% = 73.9\%\) of the variability in the input variables. The sum of the sample variances of the \(Z_j\) in Table 17.5 is (within rounding error) \[ 2.957 + 0.844 + 0.199 + 0.0006 = 4 \]
Statistic | \(X_1\) | \(X_2\) | \(X_3\) | \(X_4\) | \(Z_1\) | \(Z_2\) | \(Z_3\) | \(Z_4\) |
---|---|---|---|---|---|---|---|---|
Sample Mean | 0.218 | 0.222 | 0.544 | 0.211 | 0 | 0 | 0 | 0 |
Sample Sd | 0.076 | 0.081 | 0.122 | 0.075 | 1.719 | 0.918 | 0.445 | 0.024 |
Sample Var | 2.957 | 0.844 | 0.199 | 0.0006 | ||||
% Variance | 73.9 % | 21% | 5% | \(<\) 1% |
If explaining 90% of the variability in the inputs is sufficient, then we would use only \(Z_1\) and \(Z_2\) in subsequent analyses–a reduction in dimension of 50%. If 2/3 of the variability is deemed sufficient, then we would use only \(Z_1\)–a reduction in dimension of 75%.
Eigenvalue Decomposition
How did we derive the matrix of PCA loadings in Equation 17.3? Mathematically, this is accomplished with a singular value decomposition (SVD) of the matrix \(\textbf{X}\) or an eigenvalue decomposition of the cross-product matrix \(\textbf{X}^\prime\textbf{X}\) or the correlation matrix of \(\textbf{X}\). Here, \(\textbf{X}\) is the \((n \times p)\) matrix with typical element \(x_{ij}\). Prior to SVD or eigenvalue decomposition the matrix is often centered and scaled.
Let’s apply that to the data in Table 17.3 to derive the PCA scores and loadings.
Example: Eigenanalysis of Correlation Matrix
You can obtain the eigenvectors and eigenvalues of a square matrix with the eigen
function. Principal components can be computed with the prcomp
or the princomp
functions, both included in the base stats
package. prcomp
uses the singular-value decomposition of \(\textbf{X}\), princomp
relies on the eigendecomposition of the covariance or correlation matrix.
First we create a data frame with the data in Table 17.3. There are ten observations of four variables.
<- data.frame("x1"=c(0.344167 ,0.363478,0.196364,0.2 ,0.226957,
data 0.204348 ,0.196522,0.165 ,0.138333,0.150833),
"x2"=c(0.363625 ,0.353739,0.189405,0.212122,0.22927 ,
0.233209 ,0.208839,0.162254,0.116175,0.150888),
"x3"=c(0.805833 ,0.696087,0.437273,0.590435,0.436957,
0.518261 ,0.498696,0.535833,0.434167,0.482917),
"x4"=c(0.160446 ,0.248539,0.248309,0.160296,0.1869 ,
0.0895652,0.168726,0.266804,0.36195 ,0.223267)
) data
x1 x2 x3 x4
1 0.344167 0.363625 0.805833 0.1604460
2 0.363478 0.353739 0.696087 0.2485390
3 0.196364 0.189405 0.437273 0.2483090
4 0.200000 0.212122 0.590435 0.1602960
5 0.226957 0.229270 0.436957 0.1869000
6 0.204348 0.233209 0.518261 0.0895652
7 0.196522 0.208839 0.498696 0.1687260
8 0.165000 0.162254 0.535833 0.2668040
9 0.138333 0.116175 0.434167 0.3619500
10 0.150833 0.150888 0.482917 0.2232670
Let’s take as input for the eigendecomposition the correlation matrix of the data:
<- cor(data)
c c
x1 x2 x3 x4
x1 1.0000000 0.9832580 0.8359297 -0.2761887
x2 0.9832580 1.0000000 0.8539085 -0.4397088
x3 0.8359297 0.8539085 1.0000000 -0.2890121
x4 -0.2761887 -0.4397088 -0.2890121 1.0000000
<- eigen(c)
e $values e
[1] 2.9570820972 0.8438197077 0.1985106437 0.0005875513
$vectors e
[,1] [,2] [,3] [,4]
[1,] -0.5550620 0.23539598 -0.45961271 0.65211273
[2,] -0.5741828 0.05247306 -0.33623520 -0.74465196
[3,] -0.5297817 0.20752636 0.82067111 0.05256496
[4,] 0.2855723 0.94803382 -0.04691458 -0.13220959
The eigen
function returns the eigenvalues in e$values
and the matrix of eigenvectors in e$vectors
. The matrix of eigenvectors is identical to the matrix of PCA loadings in Equation 17.3.
The input matrix of the decomposition can be reconstructed from those:
<- e$vectors %*% diag(e$values) %*% t(e$vectors)
A round((c - A),4)
x1 x2 x3 x4
x1 0 0 0 0
x2 0 0 0 0
x3 0 0 0 0
x4 0 0 0 0
What happens if we multiply the \(10 \times 4\) data matrix with the \(4\times 4\) matrix of eigenvectors of the correlation matrix? First, we center and scale the data. The result is a \(10 \times 4\) matrix of linear transformations. These are the scores of the PCA.
<- scale(data,center=TRUE,scale=TRUE)
data_c <- data_c %*% e$vectors
scores scores
[,1] [,2] [,3] [,4]
[1,] -3.2478623 0.2815532 0.43998889 -0.0292581389
[2,] -2.5101127 1.2574623 -0.42484650 0.0254701453
[3,] 0.9925370 0.1935176 -0.46544557 -0.0005376776
[4,] -0.1908526 -0.6286757 0.49817675 0.0413147515
[5,] 0.2550067 -0.4593212 -0.77973866 0.0014573403
[6,] -0.3285776 -1.6137133 -0.05490332 -0.0226596733
[7,] 0.2862026 -0.6907719 -0.08654684 -0.0123166161
[8,] 1.0581238 0.4785600 0.48496730 -0.0088830663
[9,] 2.3797940 1.3913628 0.09775326 -0.0229367993
[10,] 1.3057412 -0.2099739 0.29059469 0.0283497342
The scores have mean 0, their variances equal the eigenvalues of the correlation matrix, their variances sum to \(p=4\), and they are uncorrelated.
round(apply(scores,2,mean),4)
[1] 0 0 0 0
round(apply(scores,2,var),4)
[1] 2.9571 0.8438 0.1985 0.0006
sum(apply(scores,2,var))
[1] 4
round(cor(scores),4)
[,1] [,2] [,3] [,4]
[1,] 1 0 0 0
[2,] 0 1 0 0
[3,] 0 0 1 0
[4,] 0 0 0 1
We use the linalg.eig
function in numpy to compute the eigenvalue decomposition of the correlation matrix in Python. A similar function that gives you more control over the computation is scipy.linalg.eig
. To compute the full PCA you can use the PCA
function from sklearn.decomposition
.
import numpy as np
import pandas as pd
= pd.DataFrame({
data "x1": [0.344167, 0.363478, 0.196364, 0.2, 0.226957,
0.204348, 0.196522, 0.165, 0.138333, 0.150833],
"x2": [0.363625, 0.353739, 0.189405, 0.212122, 0.22927,
0.233209, 0.208839, 0.162254, 0.116175, 0.150888],
"x3": [0.805833, 0.696087, 0.437273, 0.590435, 0.436957,
0.518261, 0.498696, 0.535833, 0.434167, 0.482917],
"x4": [0.160446, 0.248539, 0.248309, 0.160296, 0.1869,
0.0895652, 0.168726, 0.266804, 0.36195, 0.223267]
})
print(data)
x1 x2 x3 x4
0 0.344167 0.363625 0.805833 0.160446
1 0.363478 0.353739 0.696087 0.248539
2 0.196364 0.189405 0.437273 0.248309
3 0.200000 0.212122 0.590435 0.160296
4 0.226957 0.229270 0.436957 0.186900
5 0.204348 0.233209 0.518261 0.089565
6 0.196522 0.208839 0.498696 0.168726
7 0.165000 0.162254 0.535833 0.266804
8 0.138333 0.116175 0.434167 0.361950
9 0.150833 0.150888 0.482917 0.223267
Let’s take as input for the eigendecomposition the correlation matrix of the data:
# Compute correlation matrix
= data.corr()
c print("\nCorrelation matrix:")
Correlation matrix:
print(c)
x1 x2 x3 x4
x1 1.000000 0.983258 0.835930 -0.276189
x2 0.983258 1.000000 0.853909 -0.439709
x3 0.835930 0.853909 1.000000 -0.289012
x4 -0.276189 -0.439709 -0.289012 1.000000
# Compute eigenvalues and eigenvectors
= np.linalg.eig(c)
eigenvalues, eigenvectors
print(eigenvalues)
[2.9570821 0.84381971 0.00058755 0.19851064]
print(eigenvectors)
[[ 0.55506204 0.23539598 0.65211273 -0.45961271]
[ 0.57418284 0.05247306 -0.74465196 -0.3362352 ]
[ 0.52978172 0.20752636 0.05256496 0.82067111]
[-0.28557228 0.94803382 -0.13220959 -0.04691458]]
The np.linalg.eig
function returns the eigenvalues and the matrix of eigenvectors. Comparing these results to Equation 17.3 we notice a change in sign and a change in the order. The change in order stems from the fact that the eigenvalues are not ordered from largest to smallest. The change in sign is due to the fact that the eigenvectors are unique only up to sign.
The input matrix of the decomposition can be reconstructed from those:
= eigenvectors @ np.diag(eigenvalues) @ eigenvectors.T
A
print(f"\nDifference between original and reconstructed correlation matrix:")
Difference between original and reconstructed correlation matrix:
print(np.round(c - A, 4))
x1 x2 x3 x4
x1 0.0 0.0 0.0 -0.0
x2 0.0 0.0 0.0 -0.0
x3 0.0 0.0 0.0 -0.0
x4 -0.0 -0.0 -0.0 0.0
What happens if we multiply the \(10 \times 4\) data matrix with the \(4\times 4\) matrix of eigenvectors of the correlation matrix? First, we center and scale the data. Because StandardScaler
uses the biased estimator of the sample variance (dividing by \(n\) rather than \(n-1\)), we further adjust the scaled data to have sample standard deviation 1.
The result is a \(10 \times 4\) matrix of linear transformations. These are the scores of the PCA.
from sklearn.preprocessing import StandardScaler
= StandardScaler()
scaler = scaler.fit_transform(data)
data_c = data_c.shape[0]
n = data_c * np.sqrt((n-1)/n)
data_c
# Compute principal component scores
= data_c @ eigenvectors
scores
print(f"\nPrincipal component scores:")
Principal component scores:
print(scores)
[[ 3.24786231 0.28155323 -0.02925814 0.43998889]
[ 2.51011272 1.2574623 0.02547015 -0.4248465 ]
[-0.992537 0.19351765 -0.00053768 -0.46544557]
[ 0.19085256 -0.62867574 0.04131475 0.49817675]
[-0.2550067 -0.45932117 0.00145734 -0.77973866]
[ 0.32857764 -1.61371327 -0.02265967 -0.05490332]
[-0.28620258 -0.69077193 -0.01231662 -0.08654684]
[-1.05812379 0.47856002 -0.00888307 0.4849673 ]
[-2.379794 1.39136279 -0.0229368 0.09775326]
[-1.30574116 -0.20997387 0.02834973 0.29059469]]
The scores have mean 0, their variances equal the eigenvalues of the correlation matrix, their variances sum to \(p=4\), and they are uncorrelated.
= np.mean(scores, axis=0)
scores_means print(scores_means)
[ 0. -0. 0. -0.]
= np.var(scores, axis=0, ddof=1)
scores_vars print(scores_vars)
[2.9570821 0.84381971 0.00058755 0.19851064]
= np.sum(scores_vars)
sum_vars print(sum_vars)
4.0
# Correlation matrix of scores (should be identity matrix)
= np.corrcoef(scores.T)
scores_corr print(np.round(scores_corr, 4))
[[ 1. -0. -0. 0.]
[-0. 1. -0. 0.]
[-0. -0. 1. 0.]
[ 0. 0. 0. 1.]]
17.7 In-database Analytics
The data processing mechanism encountered in most examples so far can be described as follows:
- the data resides in a file or database external to the Python or
R
session - the data are loaded from the file into a DataFrame in Python or into a
R
data frame - the data loaded into the
R
or Python session is summarized
This seems obvious and you might wonder if there is any other way? If you start with a CSV file, how else could data be processed? That is correct for CSV files as the file itself, or the operating system accessing it, does not have the ability to perform statistical calculations.
This is different however, if the data are stored in an analytic database. For example, check out this page of statistical aggregation functions available in DuckDB.
Suppose you want to calculate the 75th percentile of a variable or the correlation between two variables. Instead of bringing the data from DuckDB into a DataFrame and calling quantile()
and cor()
, you could ask the database to compute the quantile and the correlation. This is termed in-database analytics.
As an example, the following code uses the fitness
table in the ads.ddb
database to compute correlations between Oxygen
and Age
, Weight
, and RunTime
for study subjects 40 years or older.
import duckdb
= duckdb.connect(database="../ads.ddb", read_only=True)
con = '''select corr(Oxygen,Age) as cOxAge, \
query_string corr(Oxygen,Weight) as cOxWgt, \
corr(Oxygen,RunTime) as cOxRT from fitness where age >= 40;'''
con.sql(query_string)
┌──────────────────────┬─────────────────────┬───────────────────┐
│ cOxAge │ cOxWgt │ cOxRT │
│ double │ double │ double │
├──────────────────────┼─────────────────────┼───────────────────┤
│ -0.14994998303951435 │ -0.2721520742498899 │ -0.86233530848729 │
└──────────────────────┴─────────────────────┴───────────────────┘
We could have accomplished the same result in the traditional way, pulling the fitness
table from the database into a DataFrame frame and using the corr
or corrwith
method in pandas:
= con.sql("SELECT * from fitness;").df()
fit_df
con.close()
= fit_df[fit_df['Age'] >= 40]
subset = ['Age','Weight','RunTime','Oxygen']
cols 'Oxygen']) subset[cols].corrwith(subset[
Age -0.149950
Weight -0.272152
RunTime -0.862335
Oxygen 1.000000
dtype: float64
In the second code chunk the entire fitness
table is loaded into the DataFrame, consuming memory. For a data set with 31 observations and 7 variables that is not a big deal. For a data set with 31 million observations and 700 variables that can be a deal breaker.
The filter is not applied when the data are pulled into the DataFrame because another calculation might want a different set of rows; we need to have access to all the observations in the Python session. The in-database approach applies the filter as the query is executed inside the database; no raw data are moved.
What are the advantages and disadvantages of this approach?
Advantages
In-database analytics is one example of pushing analytical calculations down to the data provider, rather than extracting rows of data and passing them to an analytic function in a different process (here, the Python session). Hadoop and Spark are other examples where computations are pushed to the data provider. Google’s BigQuery cloud data warehouse is an example where even complex machine learning algorithms can be pushed into a distributed data framework. The advantages listed here apply to all forms of push-down analytics.
Analytic databases such as DuckDB are designed for analytical work, that is, deep scans of tables that process a limited number of columns, filter data, and compute aggregations.
Analytic databases are optimized for these calculations and their performance is typically excellent; they take advantage of parallel processing where possible.
The data are not moved from the database into the Python session. This is a form of client–server processing where the database is the server and the Python session is the client. Only the results are sent to the client, which is a much smaller set of numbers than the raw data. This is a very important feature of in-database analytics. Many organizations resist moving data out of an operational system and making data copies on other devices. It presents a major data governance and security challenge. With in-database analytics, the rows of data remain in the database, only aggregations are sent to the calling client.
The same functionality can be accessed from multiple clients, as long as the clients are able to send SQL queries to the database.
Very large volumes of data can be processed since the data are already held in the database and do not need to be moved to a laptop or PC that would not have sufficient memory to hold the data frame. You do not need two copies of the data before making any calculations.
Disadvantages
There are a few downsides when you rely on the data provider to perform the statistical computations.
Limited support for statistical functionality. In most cases you are limited to simple aggregations and summarization. More complex statistical analysis, for example, fitting models, simulations, and optimizations are beyond the scope of most data engines.
You have to express the analytics using the language or interface of the data provider. In most cases that is some form of SQL. You have to figure out the syntax and it can be more cumbersome than using
R
or Python functions.Statistical calculations require memory and CPU/GPU resources. Database administrators (DBAs) might frown upon using resources for statistical work when the primary reason for the database is different—for example, reporting company metrics on dashboards. Even if the database could perform more advanced statistical computing such as modeling, optimizations, and simulations, DBAs do not want an operational database to be hammered by those.
Example
Below is an example of an R
function that performs summarization of data stored in a DuckDB database using in-database analytics. The SELECT statement computes several statistics for a single column of the database, specified in the columnName
argument. The function supports a WHERE clause and a single GROUP BY variable. As an exercise, you can enhance the function to accept a list of analysis variables and a list of GROUP BY variables.
<- function(tableName,
ducksummary
columnName,whereClause=NULL,
groupBy=NULL,
dbName="../ads.ddb") {
if (!is.null(tableName) & !is.null(columnName)) {
if (!("duckdb" %in% (.packages()))) {
suppressWarnings(library("duckdb"))
message("duckdb library was loaded to execute duckload().")
}<- dbConnect(duckdb(), dbdir=dbName, read_only=TRUE)
con if (!is.null(groupBy)) {
<- paste("SELECT ",groupBy,",")
query_string else {
} <- paste("SELECT ")
query_string
}<- paste(query_string,
query_string "count(" , columnName,") as N," ,
"min(" , columnName,") as Min," ,
"quantile_cont(", columnName, ",0.25) as Q1,",
"avg(" , columnName,") as Mean," ,
"median(" , columnName,") as Median," ,
"quantile_cont(", columnName,",0.75) as Q3," ,
"max(" , columnName,") as Max," ,
"stddev_samp(" , columnName, ") as StdDev" ,
" from ", tableName)
if (!is.null(whereClause)) {
<- paste(query_string, " WHERE ", whereClause)
query_string
}if (!is.null(groupBy)) {
<- paste(query_string, " GROUP BY ", groupBy)
query_string
}<- paste(query_string,";")
query_string #print(query_string)
<- dbGetQuery(con, query_string)
df_ dbDisconnect(con)
return (df_)
else {
} return (NULL)
} }
The next statement requests a summary of the Oxygen
variable in the fitness
table for records where Age
is at least 40.
ducksummary("fitness","Oxygen","Age >= 40")
N Min Q1 Mean Median Q3 Max StdDev
1 29 37.388 44.811 46.85245 46.672 49.156 59.571 4.91512
The next statement analyzes Ripeness
of the bananas in the training data set by banana quality.
ducksummary("banana_train","Ripeness",groupBy="Quality")
Quality N Min Q1 Mean Median Q3 Max
1 Bad 2000 -7.423155 -1.6555529 0.06011369 0.1125737 1.734655 5.911077
2 Good 2000 -4.239602 0.4206124 1.52396766 1.4982581 2.641176 7.249034
StdDev
1 2.276053
2 1.644378
Note that the only data frames created in this approach are the data frames to hold the analysis results. The data remains where it is—in the database.