6 Summarization

6.1 Introduction

Definition: Data Summarization

Data summarization is the numerical, tabular, and graphical distillation of the essence of a data set and the relationships between its variables through aggregation.

A single statistic such as the sample standard deviation is a summary, as is a cross-tabulation of the levels of two categorical variables, as is a series of box plots. The purposes of data summarization are many:

Profiling. Borne (2021) refers to it as “having that first date with your data.” We want to know what we are dealing with.
Description. What are the central tendencies and the dispersion of the variables? For example, what does the comparison of statistics measuring the central tendency tell us about the distribution of the data, the presence of outliers? What distributional assumptions are reasonable for the data. Are transformations in a feature processing step called for?
Aggregation. Suppose you could not store the raw data but you need to retain information for future statistical processing. What kind of information would you compute and squirrel away? A sufficient statistic is a function of the data that contains all information toward estimating a parameter of the data distribution. For example, the sample mean \(\frac{1}{n}\sum_{i=1}^n Y_i\) is sufficient to estimate the mean of a set of \(n\) independent and identically distributed random variables. If \(Y\) is uniform on \([0,\theta]\), then \(\max\{Y_i\}\) is sufficient for \(\theta\). The quantities we compute during summarization are frequently sufficient statistics; examples are sums, sums of squares, sums of crossproducts.
Relationships. Summarization is not only about individual variables, but also about their relationship. The correlation matrix of \(p\) numerical variables is a frequent summary that describes the pairwise linear relationships in the data.
Dimension Reduction. In high-dimensional statistical problems the number of potential input variables is larger than what we can handle (on computational grounds) or should handle (on statistical grounds). Summarization can reduce a \(p\)-dimensional problem to an \(m\)-dimensional problem where \(m \ll p\). Principal component analysis (PCA) relies on matrix factorization (eigenvalue or singular value decomposition) of a crossproduct matrix to find a set of \(m\) linear combinations of the \(p\) input variables that explain a substantial amount of variability in the data. The \(m\) linear combinations summarize the essence of the relationships in the data.

6.2 Location and Dispersion Statistics

Common location and dispersion measures for quantitative variables are shown in Table 6.1 and Table 6.2.

Table 6.1: Important statistics measuring location attributes of a variable in a sample of size \(n\). Sample mean, sample median, and sample mode are measures of the central tendency of a variable. \(Y^*\) denotes the ordered sequence of observations and \(Y^*[k]\) the value at the \(k\)th position in the ordered sequence. The min is defined as the smallest non-missing value because NaNs often sort as the smallest values in software packages.

Sample Statistic	Symbol	Computation	Notes
Min		\(Y^*[1]\)	The smallest non-missing value
Max		\(Y^*[n]\)	The largest value
Mean	\(\overline{Y}\)	\(\frac{1}{n}\sum_{i=1}^n Y_i\)	Most important location measure, but can be affected by outliers
Median	Med	\[\left \{ \begin{array}{cc} Y^* \left[ \frac{n+1}{2} \right ] & n \text{ is even} \\ \frac{1}{2} \left( Y^* \left[\frac{n}{2} \right] + Y^* \left[\frac{n}{2}+1\right] \right) & n\text{ is odd} \end{array}\right .\]	Half of the observations are smaller than the median; robust against outliers
Mode	Mode		The most frequent value; not useful when real numbers are unique
1st Quartile		\(Y^*\left[\frac{1}{4}(n+1) \right]\)	25% of the observations are smaller than \(Q_1\)
2nd Quartile		See Median	50% of the observations are smaller than \(Q_2\). This is the median
3rd Quartile		\(Y^*\left[\frac{3}{4}(n+1) \right]\)	75% of the observations are smaller than \(Q_3\)
X% Percentile		\(Y^*\left[\frac{X}{100}(n+1) \right]\)	For example, 5% of the observations are larger than \(P_{95}\), the 95% percentile

Table 6.2: Important statistics measuring dispersion (variability) of a variable in a sample of size \(n\).

Sample Statistic	Symbol	Computation	Notes
Range	\(R\)	\(Y^[n] - Y^[1]\)	Simply largest minus smallest value
Inter-quartile Range	IQR	\(Q_3 - Q_1\)	Used in constructing box plots; covers the central 50% of the data
Standard Deviation	\(S\)	\(\sqrt{\frac{1}{n-1}\sum_{i=1}^n\left( Y_i - \overline{Y}\right)^2}\)	Most important dispersion measure; in the same units as the sample mean (the units of \(Y\))
Variance	\(S^2\)	\(\frac{1}{n-1}\sum_{i=1}^n \left( Y_i - \overline{Y} \right)^2\)	Important statistical measure of dispersion; in squared units of \(Y\)

`stats` Package

The following list of functions provides basic location and dispersion statistics in R (stats package). The min and max functions are provided by the base package.

Table 6.3: R summarization functions in stats.

Function	Description	Notes
`mean`	Sample (arithmetic) mean	can be trimmed (`trim=`)
`median`	Sample median
`sd`	Standard deviation	uses \(n-1\) as denominator
`var`	Sample variance	uses \(n-1\) as denominator. Also can calculate variance-covariance matrix of vectors
`range`	Minimum and maximum	returns a vector with two values
`IQR`	Interquartile range
`mad`	Median absolute deviation	default metric for center is the median
`quantile`	Sample quantiles	give list of probabilities in `probs=` argument; default is minimum, maximum, \(Q_1\), \(Q_2\), and \(Q_3\)
`fivenum`	Tukey’s five number summary	Min, lower hinge, median, upper hinge, maximum
`boxplot.stats`	Statistics to construct box plot	`stats` value contains extreme of lower whisker, lower hinge, median, upper hinge, extreme of upper whisker
`cov`	Sample covariance	Single statistic or covariance matrix
`cor`	Sample correlation	Single statistic or correlation matrix

The summary function in R is a generic function that produces summaries of an R object. The return value depends on the type of its argument. When summary is called on a data frame, it returns basic location statistics for the numeric variables and the counts-per-level for factors. Interestingly, it does not compute any measures of dispersion.

summary(iris)

  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50

To compute some of the functions in Table 6.3 for multiple columns in a matrix or data frame, the apply function is very helpful. The following function call request the mean for the numeric columns (the first four columns) of the iris data set.

apply(iris[,-5],2,mean)

Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
    5.843333     3.057333     3.758000     1.199333

The next statement requests the standard deviations

apply(iris[,-5],2,sd)

Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
   0.8280661    0.4358663    1.7652982    0.7622377

The second argument specifies the margin over which the function will be applied: 1 implies calculations for rows, 2 implies calculations for columns. The following statements compute column and row sums for the 50 I. setosa observations in the iris data set

col.sums <- apply(iris[iris$Species=="setosa",1:4], 2, sum)
col.sums

Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
       250.3        171.4         73.1         12.3

row.sums <- apply(iris[iris$Species=="setosa",1:4], 1, sum)
row.sums

   1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
10.2  9.5  9.4  9.4 10.2 11.4  9.7 10.1  8.9  9.6 10.8 10.0  9.3  8.5 11.2 12.0 
  17   18   19   20   21   22   23   24   25   26   27   28   29   30   31   32 
11.0 10.3 11.5 10.7 10.7 10.7  9.4 10.6 10.3  9.8 10.4 10.4 10.2  9.7  9.7 10.7 
  33   34   35   36   37   38   39   40   41   42   43   44   45   46   47   48 
10.9 11.3  9.7  9.6 10.5 10.0  8.9 10.2 10.1  8.4  9.1 10.7 11.2  9.5 10.7  9.4 
  49   50 
10.7  9.9

`dplyr` Package

The dplyr package is part of the tidyverse, an “opinionated” collection of R packages for data science. Packages in the tidyverse include dplyr, ggplot2, tibble, tidyr, purr, stringr, and readr. dplyr, tidyr and ggplot2 are arguably the most important packages in the collection (that is my “opinionated” view).

The tidyverse packages share a common philosophy; that makes it easy to use code across multiple packages and to combine them. An example of this common philosophy is piping.

Suppose we use dplyr to compute the mean and standard deviation for the petal width of I. virginica observations in the iris data set. You can put this together in a single function call to the dplyr::summarize function.

library(dplyr)
summarize(filter(iris,Species=="virginica"), 
          count=n(),
          mean=mean(Petal.Width), 
          stdDev=sd(Petal.Width))

  count  mean    stdDev
1    50 2.026 0.2746501

A more elegant way of processing the data is with a series of steps, where the result of one step is passed automatically to the next step. In tidyverse syntax, such a step is called a pipe and indicated with %>%. Rewriting the previous summarize statement using pipes leads to

iris %>% filter(Species=="virginica") %>%
    summarize(count =n(),
              mean  =mean(Petal.Width),
              stdDev=sd(Petal.Width))

  count  mean    stdDev
1    50 2.026 0.2746501

The pipeline starts with the data frame iris. Since a pipe operation passes its input as the first argument to the next operation, the filter statement is really filter(iris,Species=="virginica).

If you want to save the result of this piping operation, simply assign it to an object:

virg_summ <- iris %>% filter(Species=="virginica") %>%
                  summarize(count =n(),
                            mean  =mean(Petal.Width),
                            stdDev=sd(Petal.Width))

Or, you can keep going, piping the result into other tidyverse functions, for example

iris %>% filter(Species=="virginica") %>%
    summarize(count =n(),
              mean  =mean(Petal.Width),
              stdDev=sd(Petal.Width)) %>%
    glimpse()

Rows: 1
Columns: 3
$ count  <int> 50
$ mean   <dbl> 2.026
$ stdDev <dbl> 0.2746501

Table 6.4 lists summarization functions in dplyr.

Table 6.4: Summarization functions in dplyr.

Function	Description	Notes
`mean`	Sample (arithmetic) mean
`median`	Sample median
`sd`	Standard deviation	uses \(n-1\) as denominator
`var`	Sample variance	uses \(n-1\) as denominator
`min`	Minimim
`max`	Maximum
`IQR`	Interquartile range
`first`	First value of a vector
`last`	Last value of a vector
`nth`	\(n\)^th value of a vector
`n`	Number of values in a vector
`n_distinct`	Number of distinct (unique) values in a vector

In the previous examples we filtered and summarized on a single variable. If you want to calculate statistics for multiple variables you can either repeat the statements or indicate that the operation should apply across multiple columns. In the early release of dplyr the summarize_each function applied summary calculations for more than one column. This function has been deprecated in favor of across, which gathers multiple columns and can be applied to other dplyr statements as well (filter, group_by, etc.).

The following statement computes the mean for all numeric columns of the iris data set.

iris %>% summarize(across(where(is.numeric), mean))

  Sepal.Length Sepal.Width Petal.Length Petal.Width
1     5.843333    3.057333        3.758    1.199333

The next form of summarize(across()) computes sample mean and median for the variables whose name begins with Sepal.

iris %>% summarize(across(starts_with("Sepal"), 
                          list(mn=mean, md=median)))

  Sepal.Length_mn Sepal.Length_md Sepal.Width_mn Sepal.Width_md
1        5.843333             5.8       3.057333              3

The order in which the data manipulations occur matter greatly for the result. In the following statement, the observations are filtered for which the Sepal.Length exceeds the average Sepal.Length by 10%.

iris %>% filter(Sepal.Length > 1.1*mean(Sepal.Length))

   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1           7.0         3.2          4.7         1.4 versicolor
2           6.9         3.1          4.9         1.5 versicolor
3           6.5         2.8          4.6         1.5 versicolor
4           6.6         2.9          4.6         1.3 versicolor
5           6.7         3.1          4.4         1.4 versicolor
6           6.6         3.0          4.4         1.4 versicolor
7           6.8         2.8          4.8         1.4 versicolor
8           6.7         3.0          5.0         1.7 versicolor
9           6.7         3.1          4.7         1.5 versicolor
10          7.1         3.0          5.9         2.1  virginica
11          6.5         3.0          5.8         2.2  virginica
12          7.6         3.0          6.6         2.1  virginica
13          7.3         2.9          6.3         1.8  virginica
14          6.7         2.5          5.8         1.8  virginica
15          7.2         3.6          6.1         2.5  virginica
16          6.5         3.2          5.1         2.0  virginica
17          6.8         3.0          5.5         2.1  virginica
18          6.5         3.0          5.5         1.8  virginica
19          7.7         3.8          6.7         2.2  virginica
20          7.7         2.6          6.9         2.3  virginica
21          6.9         3.2          5.7         2.3  virginica
22          7.7         2.8          6.7         2.0  virginica
23          6.7         3.3          5.7         2.1  virginica
24          7.2         3.2          6.0         1.8  virginica
25          7.2         3.0          5.8         1.6  virginica
26          7.4         2.8          6.1         1.9  virginica
27          7.9         3.8          6.4         2.0  virginica
28          7.7         3.0          6.1         2.3  virginica
29          6.9         3.1          5.4         2.1  virginica
30          6.7         3.1          5.6         2.4  virginica
31          6.9         3.1          5.1         2.3  virginica
32          6.8         3.2          5.9         2.3  virginica
33          6.7         3.3          5.7         2.5  virginica
34          6.7         3.0          5.2         2.3  virginica
35          6.5         3.0          5.2         2.0  virginica

The average is computed across all 150 observations in the data set (50 observations for each species). By adding a group_by statement prior to the filter, the sepal lengths are being compared to the species-specific means. The presence of the group_by statement conditions the subsequent filter to be applied separately for each of the three species.

iris %>% group_by(Species) %>% filter(Sepal.Length > 1.1*mean(Sepal.Length))

# A tibble: 19 × 5
# Groups:   Species [3]
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
          <dbl>       <dbl>        <dbl>       <dbl> <fct>     
 1          5.8         4            1.2         0.2 setosa    
 2          5.7         4.4          1.5         0.4 setosa    
 3          5.7         3.8          1.7         0.3 setosa    
 4          7           3.2          4.7         1.4 versicolor
 5          6.9         3.1          4.9         1.5 versicolor
 6          6.6         2.9          4.6         1.3 versicolor
 7          6.7         3.1          4.4         1.4 versicolor
 8          6.6         3            4.4         1.4 versicolor
 9          6.8         2.8          4.8         1.4 versicolor
10          6.7         3            5           1.7 versicolor
11          6.7         3.1          4.7         1.5 versicolor
12          7.6         3            6.6         2.1 virginica 
13          7.3         2.9          6.3         1.8 virginica 
14          7.7         3.8          6.7         2.2 virginica 
15          7.7         2.6          6.9         2.3 virginica 
16          7.7         2.8          6.7         2   virginica 
17          7.4         2.8          6.1         1.9 virginica 
18          7.9         3.8          6.4         2   virginica 
19          7.7         3            6.1         2.3 virginica

None of the I. setosa observations exceeded the overall sepal length by 10%. But three of the I. setosa observations exceeded the petal length of that species by 10%.

6.3 Group-by Summarization

Using `stats::aggregate`

The preceding dplyr pipeline is an example of group-by processing: computing summaries separately for the levels of a qualitative variable (Species).

Group-by summarization is also possible with functions in the stats package. The aggregate function allows the use of the formula syntax that is common in many stats function. In addition to the “model”, you need to specify the aggregation function you wish to apply. For example, to compute the means for petal length by species and the standard deviations for petal length and petal width by species, use the following syntax:

aggregate(Petal.Length ~ Species, data=iris, FUN="mean")

     Species Petal.Length
1     setosa        1.462
2 versicolor        4.260
3  virginica        5.552

aggregate(cbind(Petal.Length,Petal.Width) ~ Species, data=iris, FUN="sd")

     Species Petal.Length Petal.Width
1     setosa    0.1736640   0.1053856
2 versicolor    0.4699110   0.1977527
3  virginica    0.5518947   0.2746501

You can also use aggregate with functions that have more complex return arguments, quantile and summary, for example.

aggregate(Petal.Length ~ Species, data=iris, FUN="quantile")

     Species Petal.Length.0% Petal.Length.25% Petal.Length.50% Petal.Length.75%
1     setosa           1.000            1.400            1.500            1.575
2 versicolor           3.000            4.000            4.350            4.600
3  virginica           4.500            5.100            5.550            5.875
  Petal.Length.100%
1             1.900
2             5.100
3             6.900

aggregate(Petal.Length ~ Species, data=iris, FUN="summary")

     Species Petal.Length.Min. Petal.Length.1st Qu. Petal.Length.Median
1     setosa             1.000                1.400               1.500
2 versicolor             3.000                4.000               4.350
3  virginica             4.500                5.100               5.550
  Petal.Length.Mean Petal.Length.3rd Qu. Petal.Length.Max.
1             1.462                1.575             1.900
2             4.260                4.600             5.100
3             5.552                5.875             6.900

Using `dplyr::group_by`

The group_by statement in dplyr is a simple technique to apply group-specific operations; it is thus one of the first statements you see in summarization after the data frame:

starwars %>% group_by(gender) %>% 
    filter(mass > median(mass,na.rm=TRUE)) %>%
    summarize(count=n())

# A tibble: 3 × 2
  gender    count
  <chr>     <int>
1 feminine      3
2 masculine    19
3 <NA>          1

You can combine group-by execution with summarization across multiple columns. The following pipeline computes the arithmetic mean of all numeric columns i n the iris data set and arranges the result by descending value of the average sepal width:

iris %>% 
    group_by(Species) %>% 
    summarize(across(where(is.numeric), mean)) %>%
    arrange(desc(Sepal.Width))

# A tibble: 3 × 5
  Species    Sepal.Length Sepal.Width Petal.Length Petal.Width
  <fct>             <dbl>       <dbl>        <dbl>       <dbl>
1 setosa             5.01        3.43         1.46       0.246
2 virginica          6.59        2.97         5.55       2.03 
3 versicolor         5.94        2.77         4.26       1.33

Continuous Grouping Variable

Group-by computations can also be used if the grouping variable is continuous. We first create bins of the continuous variable, assign observations to the bins, and then group the summaries by the bins. The following code assigns observations to four bins based on the quartiles of Sepal.Length an then computes the average Sepal.Width for each level of Sepal.Length.

library(dplyr)

iris_data <- iris
1qs <- quantile(iris$Sepal.Length, probs=c(0, 0.25, 0.5, 0.75, 1))
qs
2iris_data$sep.len.cut <- cut(x=iris$Sepal.Length, breaks = qs)

# Fix the assignment of the minimum value to the first category
iris_data$sep.len.cut[which.min(iris$Sepal.Length)] <- 
    attr(iris_data$sep.len.cut,"levels")[1]

iris_data %>% group_by(sep.len.cut) %>%
         summarize(count=n(), mean=mean(Sepal.Width))

1: The quantile function computes the sample quantiles for the requested vector of probabilities. 0 and 1 are included to capture the minimum and maximum value.
2: The cut function applies the computed quantiles as break points to bin the values of Sepal.Length.

  0%  25%  50%  75% 100% 
 4.3  5.1  5.8  6.4  7.9 
# A tibble: 4 × 3
  sep.len.cut count  mean
  <fct>       <int> <dbl>
1 (4.3,5.1]      41  3.18
2 (5.1,5.8]      39  3.09
3 (5.8,6.4]      35  2.87
4 (6.4,7.9]      35  3.07

You can include the creation of the categories directly into the group_by statement if the cutpoints require no further processing.

mtcars %>%
    group_by(hp_cut = cut(hp, 3)) %>%
    summarize(count=n(), mn_mpg=mean(mpg), mean_disp=mean(disp))

# A tibble: 3 × 4
  hp_cut     count mn_mpg mean_disp
  <fct>      <int>  <dbl>     <dbl>
1 (51.7,146]    17   24.2      135.
2 (146,241]     11   15.7      339.
3 (241,335]      4   14.6      340.

6.4 In-database Analytics

The data processing mechanism encountered so far can be described as follows:

the data resides in a file or database external to the R session
the data are loaded from the file as a data frame into the R session
the data frame is summarized

This seems obvious and you might wonder if there is any other way? If you start with a CSV file, how else could data be processed? That is correct for CSV files as the file itself, or the operating system accessing it, does not have the ability to perform statistical calculations.

This is different however, if the data are stored in an analytic database. For example, check out this page of statistical aggregation functions available in DuckDB.

Suppose you want to calculate the 75^th percentile of a variable or the correlation between two variables. Instead of bringing the data from DuckDB into an R data frame and calling quantile() and cor(), you could ask the database to compute the quantile and the correlation. This is termed in-database analytics.

As an example, the following code uses the fitness table in the ads.ddb database to compute correlations between Oxygen and Age, Weight, and RunTime for study subjects 40 years or older.

library(duckdb)
con <- dbConnect(duckdb(), 
                 dbdir="ads.ddb",
                 read_only=TRUE)
query_string <- "select corr(Oxygen,Age) as cOxAge, 
                 corr(Oxygen,Weight) as cOxWgt, 
                 corr(Oxygen,RunTime) as cOxRT from fitness where age >= 40;"
dbGetQuery(con, query_string)

    cOxAge     cOxWgt      cOxRT
1 -0.14995 -0.2721521 -0.8623353

dbDisconnect(con)

We could have accomplished the same result in the traditional way, pulling the fitness table from the database into an R data frame and using the cor function in R:

con <- dbConnect(duckdb(), dbdir="ads.ddb", read_only=TRUE)
fit_df <- dbGetQuery(con, "select * from fitness;")
dbDisconnect(con)

set <- fit_df$Age >= 40
cor(fit_df[set,c("Age","Weight","RunTime")],fit_df$Oxygen[set])

              [,1]
Age     -0.1499500
Weight  -0.2721521
RunTime -0.8623353

In the second code chunk the entire fitness table is loaded into the R session, consuming memory. For a data set with 31 observations and 7 variables that is not a big deal. For a data set with 31 million observations and 700 variables that can be a deal breaker.

The filter is applied to the correlation operation because we loaded all the data. Another calculation might apply a different filter, so we need to have access to all observations. In the in-database approach the filter is applied to the query and no data are moved.

What are the advantages and disadvantages of this approach?

Advantages

In-database analytics is one example of pushing analytical calculations down to the data provider, rather than extracting rows of data into an analytical engine such as R. Hadoop and Spark are other examples where computations are pushed into the data provider itself. Google’s BigQuery cloud data warehouse is an example where even complex machine learning algorithms can be pushed into a distributed data framework. The advantages listed below apply to all forms of push-down analytics.

Analytic databases such as DuckDB are designed for analytical work, that is, deep scans of tables that process a limited number of columns, filter data, and compute aggregations.
Analytic databases are optimized for these calculations and their performance is typically excellent; they take advantage of parallel processing where possible.
The data is not being moved from the database into the R session. Only the results are being moved between the two. This is a very important attribute of in-database analytics. You will find that many organizations resist moving data out of an operational system and making data copies on other devices. It presents a major data governance and security challenge. With in-database analytics, the rows of data remains in the database, only aggregations are sent to the calling session.
The same functionality can be accessed from multiple compute clients, as long as the clients are able to send SQL queries to the database.
Very large volumes of data can be processed since the data are already held in the database and do not need to be moved to a laptop or PC that would not have sufficient memory to hold the data frame. You do not need two copies of the data before making any calculations.

Disadvantages

There are a few downsides when you rely on the data provider to perform the statistical computations.

Limited support for statistical functionality. In most cases you are limited to simple aggregations and summarization. More complex statistical analysis, for example, fitting models, simulations, and optimizations are beyond the scope of most data engines.
You have to express the analytics using the language or interface of the data provider. In most cases that is some form of SQL. You have to figure out the syntax and it can be more cumbersome than the R or Python functions you are used to.
Statistical calculations require memory and CPU/GPU resources. Database administrators (DBAs) might frown upon using resources for statistical work when the primary reason for the database is different—for example, reporting company metrics on dashboards. Even if the database could perform more advanced statistical computing such as modeling, optimizations, and simulations, DBAs do not want an operational database to be hammered by those.

Example

Below is an example of an R function that performs summarization of data stored in a DuckDB database using in-database analytics. The SELECT statement computes several statistics for a single column of the database, specified in the columnName argument. The function supports a WHERE clause and a single GROUP BY variable. As an exercise, you can enhance the function to accept a list of GROUP BY variables.

ducksummary <- function(tableName, 
                        columnName,
                        whereClause=NULL, 
                        groupBy=NULL,
                        dbName="ads.ddb") {
    if (!is.null(tableName) & !is.null(columnName)) {
        if (!("duckdb" %in% (.packages()))) {
            suppressWarnings(library("duckdb"))
            message("duckdb library was loaded to execute duckload().")

        }
        con <- dbConnect(duckdb(), dbdir=dbName, read_only=TRUE)
        if (!is.null(groupBy)) {
            query_string <- paste("SELECT ",groupBy,",")
        } else {
            query_string <- paste("SELECT ")
        }
        query_string <- paste(query_string, 
                              "count("        , columnName,") as N,"       ,
                              "min("          , columnName,") as Min,"     ,
                              "quantile_cont(", columnName, ",0.25) as Q1,",
                              "avg("          , columnName,") as Mean,"    ,
                              "median("       , columnName,") as Median,"  ,
                              "quantile_cont(", columnName,",0.75) as Q3," ,
                              "max("          , columnName,") as Max,"     ,
                              "stddev_samp("  , columnName, ") as StdDev"  ,
                             " from ", tableName)
        if (!is.null(whereClause)) {
            query_string <- paste(query_string, " WHERE ", whereClause)
        }
        if (!is.null(groupBy)) {
            query_string <- paste(query_string, " GROUP BY ", groupBy)
        }
        # Adding the semicolon at the end is correct syntax, but
        # not necessary if you submit with dbGetQuery. It will
        # add a missing semicolon.
        query_string <- paste(query_string,";")
        #print(query_string)
        df_ <- dbGetQuery(con, query_string)
        dbDisconnect(con)
        return (df_)
    } else {
        return (NULL)
    }
}

The next statement requests a summary of the Oxygen variable in the fitness table for records where Age is at least 40.

ducksummary("fitness","Oxygen","Age >= 40")

   N    Min     Q1     Mean Median     Q3    Max  StdDev
1 29 37.388 44.811 46.85245 46.672 49.156 59.571 4.91512

The next statement analyzes Ripeness of the bananas in the training data set by banana quality.

ducksummary("banana_train","Ripeness",groupBy="Quality")

  Quality    N       Min         Q1       Mean    Median       Q3      Max
1     Bad 2000 -7.423155 -1.6555529 0.06011369 0.1125737 1.734655 5.911077
2    Good 2000 -4.239602  0.4206124 1.52396766 1.4982581 2.641176 7.249034
    StdDev
1 2.276053
2 1.644378

Note that the only data frames created in this approach are the data frames to hold the analysis results. The data remains where it is at—in the database.