30 Messy Data

When you google “messy data”, you hit mostly on data quality issues such as duplicate records, inconsistent formats, missing values, invalid entries, outliers, and so on. We discussed those data quality issues in Chapter 16. There is a much more fundamental level of messiness beyond data cleanup that makes modeling data more challenging. Here are examples of messy data causing messy modeling:

The data source contains all the necessary information at the right data quality, but is organized in a way that makes analytic processing difficult.
The data are noisy and it is difficult to find any signal through the noise.
The data for a classification model is highly unbalanced, only a few observations fall into the category you are interested in. This is a common problem when modeling rare events.
You analyze data from a designed experiment but it lacks replication and you have to figure out a way to estimate the experimental error without bias.
The distribution of the data is multi-modal, suggesting that the underlying random process could be a mixture of distributions and requires a mixture of models.
You work with compositional data where features are constrained to sum to a constant (like percentages).
Model assumptions such as heteroscedasticity or uncorrelated errors are clearly violated.
Relationships are highly nonlinear and cannot be captured by a simple global model. Flexible models that capture nonlinear relationship such as decision trees do not perform well and complex models such as neural networks are not interpretable.
Data has only been partially observed, for example survival times at the end of the study are naturally censored; those that have survived until now will survive past the end of the study.
The missing value process is missing not at random and a complete-case analysis will bias the results.
The data exhibit irregular cyclical behavior that does not follow seasonal patterns.
The model features are highly linearly related (multicollinearity), making it difficult to isolate their individual effects.
Non-stationary random processes make it difficult to predict the future based on historical data.
You use surrogate variables for difficult to quantify attributes, leading to model oversimplification. Is citation count in academia a reliable surrogate for research impact?

This is not an exhaustive list and you will add to it through your own data science experiences. We cannot address remedies here in a first course on data science but address some of the topics.

30.1 Wrangling into Tabular Form

One interpretation of data messiness comes from the folks who brought us tidyverse. Data is untidy (messy) in their opinion, if it violates the tidy data set rules: each variable is in its own column and each observation is in its own row. Data often do not start out that way.

Wrangling data refers to the steps that organize data in a structured format that is suitable for analytic processing. The outcome of a wrangled data source is typically a table, a two-dimensional array of observations in rows and variables (features) in columns. There are no data quality issues, the data is simply not in the correct form for analytic processing.

Consider the classic airline passenger time series data from Box, Jenkins, and Reinsel (1976), showing monthly totals of international airline passengers between 1994 and 1960 (in thousands).

AirPassengers

     Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1949 112 118 132 129 121 135 148 148 136 119 104 118
1950 115 126 141 135 125 149 170 170 158 133 114 140
1951 145 150 178 163 172 178 199 199 184 162 146 166
1952 171 180 193 181 183 218 230 242 209 191 172 194
1953 196 196 236 235 229 243 264 272 237 211 180 201
1954 204 188 235 227 234 264 302 293 259 229 203 229
1955 242 233 267 269 270 315 364 347 312 274 237 278
1956 284 277 317 313 318 374 413 405 355 306 271 306
1957 315 301 356 348 355 422 465 467 404 347 305 336
1958 340 318 362 348 363 435 491 505 404 359 310 337
1959 360 342 406 396 420 472 548 559 463 407 362 405
1960 417 391 419 461 472 535 622 606 508 461 390 432

The data are arranged neatly, each year in a separate row, each month in a separate column. This is a great layout for time series analysis but maybe not the best layout for general analytic processing. The analysis variables are year, month, and the passenger count. Restructuring the AirPassengers data into a data frame with variables for year and count is easy because this data set is stored as an R time series.

str(AirPassengers)

 Time-Series [1:144] from 1949 to 1961: 112 118 132 129 121 135 148 148 136 119 ...

dat <- data.frame(count=as.matrix(AirPassengers), 
                  time=time(AirPassengers),
                  year=round(time(AirPassengers),0))
head(dat)

  count     time year
1   112 1949.000 1949
2   118 1949.083 1949
3   132 1949.167 1949
4   129 1949.250 1949
5   121 1949.333 1949
6   135 1949.417 1949

Gapminder Data

In general, the structure of a non-tabular format is not known to the software and must be wrangled by the user. We do that here for the Gapminder dataset. The Gapminder Foundation is a Swedish non-profit that promotes the United Nations sustainability goals through use of statistics. The Gapminder data set tracks economic and social indicators like life expectancy and the GDP per capita of countries over time.

A wide format of the Gapminder data is available on GitHub here. We are following the material in this excellent online resource in wrangling the data in the wide format.

R
Python

gap_wide <- read.csv(file="../datasets/gapminder_wide.csv")
str(gap_wide)

'data.frame':   142 obs. of  38 variables:
 $ continent     : chr  "Africa" "Africa" "Africa" "Africa" ...
 $ country       : chr  "Algeria" "Angola" "Benin" "Botswana" ...
 $ gdpPercap_1952: num  2449 3521 1063 851 543 ...
 $ gdpPercap_1957: num  3014 3828 960 918 617 ...
 $ gdpPercap_1962: num  2551 4269 949 984 723 ...
 $ gdpPercap_1967: num  3247 5523 1036 1215 795 ...
 $ gdpPercap_1972: num  4183 5473 1086 2264 855 ...
 $ gdpPercap_1977: num  4910 3009 1029 3215 743 ...
 $ gdpPercap_1982: num  5745 2757 1278 4551 807 ...
 $ gdpPercap_1987: num  5681 2430 1226 6206 912 ...
 $ gdpPercap_1992: num  5023 2628 1191 7954 932 ...
 $ gdpPercap_1997: num  4797 2277 1233 8647 946 ...
 $ gdpPercap_2002: num  5288 2773 1373 11004 1038 ...
 $ gdpPercap_2007: num  6223 4797 1441 12570 1217 ...
 $ lifeExp_1952  : num  43.1 30 38.2 47.6 32 ...
 $ lifeExp_1957  : num  45.7 32 40.4 49.6 34.9 ...
 $ lifeExp_1962  : num  48.3 34 42.6 51.5 37.8 ...
 $ lifeExp_1967  : num  51.4 36 44.9 53.3 40.7 ...
 $ lifeExp_1972  : num  54.5 37.9 47 56 43.6 ...
 $ lifeExp_1977  : num  58 39.5 49.2 59.3 46.1 ...
 $ lifeExp_1982  : num  61.4 39.9 50.9 61.5 48.1 ...
 $ lifeExp_1987  : num  65.8 39.9 52.3 63.6 49.6 ...
 $ lifeExp_1992  : num  67.7 40.6 53.9 62.7 50.3 ...
 $ lifeExp_1997  : num  69.2 41 54.8 52.6 50.3 ...
 $ lifeExp_2002  : num  71 41 54.4 46.6 50.6 ...
 $ lifeExp_2007  : num  72.3 42.7 56.7 50.7 52.3 ...
 $ pop_1952      : num  9279525 4232095 1738315 442308 4469979 ...
 $ pop_1957      : num  10270856 4561361 1925173 474639 4713416 ...
 $ pop_1962      : num  11000948 4826015 2151895 512764 4919632 ...
 $ pop_1967      : num  12760499 5247469 2427334 553541 5127935 ...
 $ pop_1972      : num  14760787 5894858 2761407 619351 5433886 ...
 $ pop_1977      : num  17152804 6162675 3168267 781472 5889574 ...
 $ pop_1982      : num  20033753 7016384 3641603 970347 6634596 ...
 $ pop_1987      : num  23254956 7874230 4243788 1151184 7586551 ...
 $ pop_1992      : num  26298373 8735988 4981671 1342614 8878303 ...
 $ pop_1997      : num  29072015 9875024 6066080 1536536 10352843 ...
 $ pop_2002      : int  31287142 10866106 7026113 1630347 12251209 7021078 15929988 4048013 8835739 614382 ...
 $ pop_2007      : int  33333216 12420476 8078314 1639131 14326203 8390505 17696293 4369038 10238807 710960 ...

import pandas as pd

gap_wide = pd.read_csv("../datasets/gapminder_wide.csv")
gap_wide.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 142 entries, 0 to 141
Data columns (total 38 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   continent       142 non-null    object 
 1   country         142 non-null    object 
 2   gdpPercap_1952  142 non-null    float64
 3   gdpPercap_1957  142 non-null    float64
 4   gdpPercap_1962  142 non-null    float64
 5   gdpPercap_1967  142 non-null    float64
 6   gdpPercap_1972  142 non-null    float64
 7   gdpPercap_1977  142 non-null    float64
 8   gdpPercap_1982  142 non-null    float64
 9   gdpPercap_1987  142 non-null    float64
 10  gdpPercap_1992  142 non-null    float64
 11  gdpPercap_1997  142 non-null    float64
 12  gdpPercap_2002  142 non-null    float64
 13  gdpPercap_2007  142 non-null    float64
 14  lifeExp_1952    142 non-null    float64
 15  lifeExp_1957    142 non-null    float64
 16  lifeExp_1962    142 non-null    float64
 17  lifeExp_1967    142 non-null    float64
 18  lifeExp_1972    142 non-null    float64
 19  lifeExp_1977    142 non-null    float64
 20  lifeExp_1982    142 non-null    float64
 21  lifeExp_1987    142 non-null    float64
 22  lifeExp_1992    142 non-null    float64
 23  lifeExp_1997    142 non-null    float64
 24  lifeExp_2002    142 non-null    float64
 25  lifeExp_2007    142 non-null    float64
 26  pop_1952        142 non-null    float64
 27  pop_1957        142 non-null    float64
 28  pop_1962        142 non-null    float64
 29  pop_1967        142 non-null    float64
 30  pop_1972        142 non-null    float64
 31  pop_1977        142 non-null    float64
 32  pop_1982        142 non-null    float64
 33  pop_1987        142 non-null    float64
 34  pop_1992        142 non-null    float64
 35  pop_1997        142 non-null    float64
 36  pop_2002        142 non-null    int64  
 37  pop_2007        142 non-null    int64  
dtypes: float64(34), int64(2), object(2)
memory usage: 42.3+ KB

The data are stored in wide format, annual values for the variables GDP, life expectancy, and population appear in separate columns. This is not unlike how the AirPassengers data is displayed, but gap_wide is not a time series object. The desired (“tidy”) way of structuring the data, where each variable is a column and each observation is a row, is a tabular structure with variables

Continent
Country
Year
GDP
Life Expectancy
Population

For each combination of continent and country there will be 12 observations, one for each year.

Step 1.: gather (melt) columns

To restructure the Gapminder data set from wide format into the desired format takes several steps.

R
Python

To move from wide to the desired long format, we use the tidyr::gather function. To do the opposite, moving from long to wide format, use the tidyr::spread function.

The first step is to gather the columns that contain the values for GDP, life expectancy, and population into separate rows. Those are all columns in gap_wide except continent and country, so we can exclude those from the gathering operation with -c(continent, country).

library(tidyverse)
gap_step1 <- gap_wide %>% 
    gather(key=vartype_year,
          value=values,
          -c(continent, country))
head(gap_step1)

  continent      country   vartype_year    values
1    Africa      Algeria gdpPercap_1952 2449.0082
2    Africa       Angola gdpPercap_1952 3520.6103
3    Africa        Benin gdpPercap_1952 1062.7522
4    Africa     Botswana gdpPercap_1952  851.2411
5    Africa Burkina Faso gdpPercap_1952  543.2552
6    Africa      Burundi gdpPercap_1952  339.2965

tail(gap_step1)

     continent        country vartype_year   values
5107    Europe         Sweden     pop_2007  9031088
5108    Europe    Switzerland     pop_2007  7554661
5109    Europe         Turkey     pop_2007 71158647
5110    Europe United Kingdom     pop_2007 60776238
5111   Oceania      Australia     pop_2007 20434176
5112   Oceania    New Zealand     pop_2007  4115771

The resulting data frame has 5112 rows and 4 rows. How do the number of rows come about?

To move from wide to the desired long format, we use the melt method of the pandas DataFrame. To do the opposite, moving from long to wide format, use the pivot method.

gap_step1 = gap_wide.melt(
    id_vars=['continent', 'country'],
    var_name='vartype_year',
    value_name='values'
)

gap_step1.shape

(5112, 4)

print(gap_step1.head())

  continent       country    vartype_year       values
0    Africa       Algeria  gdpPercap_1952  2449.008185
1    Africa        Angola  gdpPercap_1952  3520.610273
2    Africa         Benin  gdpPercap_1952  1062.752200
3    Africa      Botswana  gdpPercap_1952   851.241141
4    Africa  Burkina Faso  gdpPercap_1952   543.255241

print(gap_step1.tail())

     continent         country vartype_year      values
5107    Europe     Switzerland     pop_2007   7554661.0
5108    Europe          Turkey     pop_2007  71158647.0
5109    Europe  United Kingdom     pop_2007  60776238.0
5110   Oceania       Australia     pop_2007  20434176.0
5111   Oceania     New Zealand     pop_2007   4115771.0

The resulting data frame has 5112 rows and 4 rows. How do the number of rows come about?

The original data has 142 rows in wide format
Three variables (gdpPercap, lifeExp,pop`) are measured for 12 years
There are 142 combinations of continent and country(the number of observations ingap_wide`).

Step 2. separate character variables

The gather function turns all variables into key-value pairs; the key is the name of the variable. The column that contains the names of the variables after the gathering is called vartype_year. In the next step dplyr::separate is called to split the character column vartype_year into two columns, one for the variable type and one for the year. The convert=TRUE option attempts to convert the data type of the new columns from character to numeric data—this fails for the vartype column but succeeds for the year column.

R
Python

The gather function turns all variables into key-value pairs; the key is the name of the variable. The column that contains the names of the variables after the gathering is called vartype_year. In the next step tidyr::separate is called to split the character column vartype_year into two columns, one for the variable type and one for the year. The convert=TRUE option attempts to convert the data type of the new columns from character to numeric data—this fails for the vartype column but succeeds for the year column.

gap_step2 <- gap_wide %>% 
    gather(key =vartype_year,
          value=values,
          -c(continent, country)) %>%
    separate(vartype_year,
           into   =c('vartype','year'),
           sep    ="_",
           convert=TRUE)
str(gap_step2)

'data.frame':   5112 obs. of  5 variables:
 $ continent: chr  "Africa" "Africa" "Africa" "Africa" ...
 $ country  : chr  "Algeria" "Angola" "Benin" "Botswana" ...
 $ vartype  : chr  "gdpPercap" "gdpPercap" "gdpPercap" "gdpPercap" ...
 $ year     : int  1952 1952 1952 1952 1952 1952 1952 1952 1952 1952 ...
 $ values   : num  2449 3521 1063 851 543 ...

head(gap_step2)

  continent      country   vartype year    values
1    Africa      Algeria gdpPercap 1952 2449.0082
2    Africa       Angola gdpPercap 1952 3520.6103
3    Africa        Benin gdpPercap 1952 1062.7522
4    Africa     Botswana gdpPercap 1952  851.2411
5    Africa Burkina Faso gdpPercap 1952  543.2552
6    Africa      Burundi gdpPercap 1952  339.2965

tail(gap_step2)

     continent        country vartype year   values
5107    Europe         Sweden     pop 2007  9031088
5108    Europe    Switzerland     pop 2007  7554661
5109    Europe         Turkey     pop 2007 71158647
5110    Europe United Kingdom     pop 2007 60776238
5111   Oceania      Australia     pop 2007 20434176
5112   Oceania    New Zealand     pop 2007  4115771

The melt method turns all variables into key-value pairs; the key is the name of the variable. The column that contains the names of the variables after the gathering is called vartype_year. In the next step str.split() is used to split the character column vartype_year into two columns, one for the variable type and one for the year.

gap_step2 = gap_wide.melt(
    id_vars=['continent', 'country'],
    var_name='vartype_year', 
    value_name='values'
)

gap_step2[['vartype', 'year']] = gap_step2['vartype_year'].str.split('_', expand=True)
gap_step2['year'] = gap_step2['year'].astype(int)  # equivalent to convert=TRUE
gap_step2 = gap_step2.drop(columns=['vartype_year'])

print(gap_step2.dtypes)

continent     object
country       object
values       float64
vartype       object
year           int64
dtype: object

print(gap_step2.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5112 entries, 0 to 5111
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   continent  5112 non-null   object 
 1   country    5112 non-null   object 
 2   values     5112 non-null   float64
 3   vartype    5112 non-null   object 
 4   year       5112 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 199.8+ KB
None

print(gap_step1.head())

  continent       country    vartype_year       values
0    Africa       Algeria  gdpPercap_1952  2449.008185
1    Africa        Angola  gdpPercap_1952  3520.610273
2    Africa         Benin  gdpPercap_1952  1062.752200
3    Africa      Botswana  gdpPercap_1952   851.241141
4    Africa  Burkina Faso  gdpPercap_1952   543.255241

print(gap_step1.tail())

     continent         country vartype_year      values
5107    Europe     Switzerland     pop_2007   7554661.0
5108    Europe          Turkey     pop_2007  71158647.0
5109    Europe  United Kingdom     pop_2007  60776238.0
5110   Oceania       Australia     pop_2007  20434176.0
5111   Oceania     New Zealand     pop_2007   4115771.0

Step 3. spread (pivot) rows into columns

We are almost there, but not quite. We now have columns for continent, country, and year, but the values column contains values of different types: 1,704 observations for GDP, 1,704 observations for life expectancy, and 1,704 observations for population.

R
Python

To create separate columns from the rows we can reverse the gather operation with the spread function—it splits key-value pairs across multiple columns. The entries in the vartype column are used by the spreading operation as the names of the new columns.

gapminder <- 
    gap_wide %>% 
    gather(key =vartype_year,
          value=values,
          -c(continent, country)) %>%
    separate(vartype_year,
           into   =c('vartype','year'),
           sep    ="_",
           convert=TRUE) %>%
    spread(vartype, values)

head(gapminder)

  continent country year gdpPercap lifeExp      pop
1    Africa Algeria 1952  2449.008  43.077  9279525
2    Africa Algeria 1957  3013.976  45.685 10270856
3    Africa Algeria 1962  2550.817  48.303 11000948
4    Africa Algeria 1967  3246.992  51.407 12760499
5    Africa Algeria 1972  4182.664  54.518 14760787
6    Africa Algeria 1977  4910.417  58.014 17152804

tail(gapminder)

     continent     country year gdpPercap lifeExp     pop
1699   Oceania New Zealand 1982  17632.41  73.840 3210650
1700   Oceania New Zealand 1987  19007.19  74.320 3317166
1701   Oceania New Zealand 1992  18363.32  76.330 3437674
1702   Oceania New Zealand 1997  21050.41  77.550 3676187
1703   Oceania New Zealand 2002  23189.80  79.110 3908037
1704   Oceania New Zealand 2007  25185.01  80.204 4115771

In summary, we used gather to create one key-value pair of the variable_year columns and values in step 1, separated out the year in step 2, and spread the remaining variable types back out into columns in step 3.

To create separate columns from the rows we can reverse the melt operation with the pivot method—it splits key-value pairs across multiple columns. The entries in the vartype column are used by the spreading operation as the names of the new columns.

gapminder = gap_step2.pivot(
    index=['continent', 'country', 'year'],
    columns='vartype',
    values='values'
).reset_index()

print(gapminder.head())

vartype continent  country  year    gdpPercap  lifeExp         pop
0          Africa  Algeria  1952  2449.008185   43.077   9279525.0
1          Africa  Algeria  1957  3013.976023   45.685  10270856.0
2          Africa  Algeria  1962  2550.816880   48.303  11000948.0
3          Africa  Algeria  1967  3246.991771   51.407  12760499.0
4          Africa  Algeria  1972  4182.663766   54.518  14760787.0

print(gapminder.tail())

vartype continent      country  year    gdpPercap  lifeExp        pop
1699      Oceania  New Zealand  1987  19007.19129   74.320  3317166.0
1700      Oceania  New Zealand  1992  18363.32494   76.330  3437674.0
1701      Oceania  New Zealand  1997  21050.41377   77.550  3676187.0
1702      Oceania  New Zealand  2002  23189.80135   79.110  3908037.0
1703      Oceania  New Zealand  2007  25185.00911   80.204  4115771.0

In summary, we used melt to create one key-value pair of the variable_year columns and values in step 1, separated out the year in step 2, and spread the remaining variable types back out into columns in step 3 (using pivot).

30.2 Low Signal to Noise Ratio

Low signal-to-noise ratio (SNR) means the true underlying patterns you want to detect are weak relative to the random variation in your data. The reason could be a weak (or no) signal, lots of variability in the data, or both. We saw in Figure 23.2 that what appears to be a signal can be all random noise.

Low SNR creates several fundamental challenges with statistical power and model stability:

More difficult to distinguish real patterns from random fluctuations
Increased risk of false positives (finding patterns that are not there)
Need for larger sample sizes to detect true effects
Model performance becomes highly sensitive to sampling variation
Small changes in training data can lead to dramatically different models
Overfitting becomes more likely as models chase the noise
Cross-validation results become unreliable and highly variable
Predictions have wide confidence intervals

Real-World Examples

Financial Markets
Stock price prediction often has extremely low SNR. Daily stock returns might have a signal (based on fundamentals, sentiment, etc.) that explains only 1-5% of the variance, with the rest being essentially random market noise. This is why most quantitative trading strategies have very modest information ratios.

Medical Diagnostics from Wearables
Detecting early signs of illness from smartphone accelerometer data has very low SNR. The subtle changes in gait or movement patterns that indicate health issues are tiny compared to normal variation in how people move, differences in phone placement, walking surfaces, etc.

Genomics
Identifying genetic variants associated with complex diseases often involves very small effect sizes. A genetic variant might increase disease risk by 1.1x, but this signal is buried in massive noise from environmental factors, other genes, and measurement error across thousands of genetic markers.

Social Media Sentiment Analysis
Predicting market movements or election outcomes from Twittersentiment has notoriously low SNR. Real predictive signals are overwhelmed by bots, sarcasm, regional biases, and the general noise of social media discourse.

Industrial Internet of Things (IoT)
Predicting equipment failure from sensor data often has low SNR because early failure signals are subtle vibrations or temperature changes that are small relative to normal operational variation and sensor noise.

How to Address Low SNR in Machine Learning

Data Strategy
- Collect more data: the most straightforward approach, since signal improves with sample size while noise averages out
- Improve measurement quality: better sensors, more precise instruments,
  standardized collection protocols
- Engineer better features: domain knowledge can help extract signal more effectively than raw measurements
- Aggregate temporal data: smooth out noise by averaging over time windows
Modeling Approaches
- Regularization: L1/L2 penalties, elastic net to prevent overfitting to noise
- Ensemble methods: random forests, gradient boosting to average out noise across multiple models
- Dimension reduction: PCA, factor analysis to focus on major sources of variation
- Robust methods: techniques less sensitive to outliers and noise
Advanced Techniques
- Denoising autoencoders: learn to reconstruct clean signals from noisy inputs
- Multi-task learning: leverage related tasks to improve signal detection
- Transfer learning: use pre-trained models that have learned general patterns
- Bayesian approaches: explicitly model uncertainty and incorporate prior knowledge
Validation and Evaluation
- Nested cross-validation: more robust performance estimates
- Repeated experiments: run multiple random seeds to assess stability
- Out-of-time validation: especially important when SNR varies over time
- Test statistical significance: ensure results are not just due to chance
Domain-Specific Approaches
- Signal processing: filtering, spectral analysis for time series data
- Causal inference: instrumental variables, natural experiments to isolate true effects
- Meta-analysis: combine results across multiple studies to boost effective sample size

The key insight is that with low SNR, you often need to fundamentally change your approach rather than just tweaking hyperparameters. The focus shifts from optimizing predictive accuracy to ensuring that you are detecting real patterns rather than fitting noise.

30.3 Unbalanced Data

The Issue

Suppose we have a two-category classification problem. If the two classes appear with roughly the same frequencies we refer to a balanced situation. Imbalance exists when the two classes have very different sample frequencies. The following confusion matrix is an example of a highly imbalanced data set. (958 + 7)/1,000 = 0.965 of the observations fall into the majority class, only 0.035 fall into the minority class.

Example of a confusion matrix for 1,000 observations.
	Observed Category
Predicted Category	Yes (Positive)	No (Negative)
Yes (Positive)	9	7
No (Negative)	26	958

The problem with highly unbalanced data sets is that they tend to favor the majority class in training. Even a completely uninformative model that would always predict the majority class (without looking at any data) would have a very high accuracy. Such a model would predict most observations correctly simply because almost all observations fall into that class. This is a problem because in most applications where data are imbalanced with respect to the target variable, the category of interest (the event category) is the minority category. When we investigate equipment failures, fraud cases, manufacturing defects, rare diseases, etc., we model rare events. Algorithms need to learn to find needles in a haystack. Furthermore, accuracy is often not the relevant model performance criterion. Sensitivity (recall), specificity, and precision are then more relevant.

To help the algorithm better learn the minority class of interest, you can adjust its focus on the minority class by using a weighted analysis or by adjusting the proportion of events to non-events. You can think of this as a data pre-processing step and it involves reducing the number of cases in the majority class (downsampling) or increasing the number of cases in the minority class (upsampling).

Downsampling eliminates at random observations in the majority class until the ratio of events and non-events meets a target value. It is not necessary to downsample to a 50:50 ratio between events and non-events. You have to achieve an event ratio where the algorithm can properly learn the minority class. You can monitor how balanced accuracy (or AUC) changes with the downsampling rate.

Problems with downsampling are the loss of information and the reduction in sample size. When working with qualitative input variables you need to ensure that downsampling does not drop levels of the input variable. For example, if you have three age groups in the data set and only two remain after downsampling, the model will not be able to make predictions for the eliminated age group. Downsampling reduces computational cost since it reduces the size of the training set and should be considered when you have large sample sizes to begin with.

Upsampling (also known as oversampling) the minority class avoids the problem of information loss. Here we add observations in the minority class to the training data until a target ratio of events and non-events (not necessarily 50:50) is achieved. While it preserves information in the data, simply randomly duplicating observations in the minority class can lead to overfitting. By duplicating data you are enticing the algorithm to learn the training data too closely.

To avoid the overfitting problem with random upsampling, methods have been developed that add not just duplicate data points. SMOTE (Synthetic Minority Oversampling Technique) is an interpolation technique that generates new data points in the minority class (Chawla et al. 2002). SMOTE works as follows:

Select an observation in the minority class at random.
Find its \(k\) nearest neighbors, this if often based on a \(k\)-NN analysis of the minority observations with \(k\)=5.
Choose one of the \(k\) nearest neighbors at random.
Interpolate between the randomly chosen observation (in step 1) and the chosen nearest neighbor, taking a random step toward the neighbor in feature space
Generate the new data point as a convex (weighted) combination between observation and the neighbor.

You can think of SMOTE as a data augmentation technique.

If data sets are highly unbalanced, upsampling will increase the computational burden. This effect is exacerbated if the target event ratio is close to 50:50 or if the data set is large to begin with. To avoid this problem and the respective pitfalls of upsampling and downsampling, the methods can be combined: downsample some observations in the majority class without sacrificing information and upsampling observations in the minority class using SMOTE. This is the procedure suggested in Chawla et al. (2002).

Application of SMOTE

This application draws on (Brownlee 2021).

We begin by simulating a two-category classification data set with high imbalance between majority and minority class. 99% of the 10,000 observations fall into the majority class (Figure 30.1).

import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
from sklearn.datasets import make_classification

X, y = make_classification(
        n_samples=10000, 
        n_features=2, 
        n_redundant=0,
        n_clusters_per_class=1, 
        weights=[0.99], 
        flip_y=0, 
        random_state=34)

# Summarize class distribution
counter = Counter(y)
print(counter)

Counter({0: 9900, 1: 100})

plt.figure(figsize=(10, 6))
for label, _ in counter.items():
    row_ix = np.where(y == label)[0]
    plt.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label));
plt.legend();
plt.tight_layout();
plt.show();

Figure 30.1: Simulated two-category data with 99% majority class unbalance.

Next we oversample the minority class with SMOTE. Note that by default the SMOTE implementation in imblearn achieves a 50:50 ratio of events and non-events (Figure 30.2).

from imblearn.over_sampling import SMOTE

oversample = SMOTE()
Xover, yover = oversample.fit_resample(X, y)

counter = Counter(yover)
print(counter)

Counter({0: 9900, 1: 9900})

plt.figure(figsize=(10, 6))

<Figure size 1000x600 with 0 Axes>

for label, _ in counter.items():
    row_ix = np.where(yover == label)[0]
    plt.scatter(Xover[row_ix, 0], Xover[row_ix, 1], label=str(label));
plt.legend();
plt.tight_layout();
plt.show();

Figure 30.2: Oversampled data with SMOTE.

The next step uses a pipeline to combine oversampling of the minority class with undersampling of the majority class. The parameters of SMOTE and RandomUnderSampler are chosen so that the minority class has 10% of the events in the majority class (sampling_strategy=0.1) and the majority class has twice as many events (sampling_strategy=0.5).

from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline

# define pipeline
over = SMOTE(sampling_strategy=0.1)
under = RandomUnderSampler(sampling_strategy=0.5)
steps = [('o', over), ('u', under)]
pipeline = Pipeline(steps=steps)
Xoverunder, yoverunder = pipeline.fit_resample(X, y)

counter = Counter(yoverunder)
print(counter)

Counter({0: 1980, 1: 990})

plt.figure(figsize=(10, 6))

<Figure size 1000x600 with 0 Axes>

for label, _ in counter.items():
    row_ix = np.where(yoverunder == label)[0]
    plt.scatter(Xoverunder[row_ix, 0], 
                Xoverunder[row_ix, 1], label=str(label));
plt.legend();
plt.tight_layout();
plt.show();

Figure 30.3: Oversampled minority class and undersampled majority class.

How well does changing the balance of the categories affect the performance of a classification model? We choose here a decision tree and apply repeated (3) 10-fold cross-validation to determine the hyperparameters. The performance of the model is measured by the average balanced accuracy.

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier()

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=565)
scores = cross_val_score(model, 
                         X, y, 
                         scoring='balanced_accuracy', 
                         cv=cv)
print('Mean Balanced Accuracy: %.3f' % np.mean(scores))

Mean Balanced Accuracy: 0.963

With three repeats of 10-fold cross-validation, the average balanced accuracy of 0.963 is computed across thirty decision trees.

Next we use the over-and-undersampled data and apply the same cross-validation approach.

model = DecisionTreeClassifier()
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=565)
scores = cross_val_score(model, 
                         Xoverunder, yoverunder, 
                         scoring='balanced_accuracy', 
                         cv=cv)
print('Mean Balanced Accuracy: %.3f' % np.mean(scores))

Mean Balanced Accuracy: 0.997

This signals a big improvement in balanced accuracy, but it is an overestimate of the improvement. Since we oversampled and undersampled the training data, the cross-validation folds are selected from the modified data, not the original data.

To get a true estimate of model performance, where the test data for cross-validation has the same structure as the original, unmodified data, we can set up a pipeline that first transforms the data, then fits the model.

model = DecisionTreeClassifier()

steps = [('over', over), ('under', under), ('model', model)]
pipeline = Pipeline(steps=steps)

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(pipeline, 
                         X, y, 
                         scoring='balanced_accuracy', 
                         cv=cv)
print('Mean Balanced Accuracy: %.3f' % np.mean(scores))

Mean Balanced Accuracy: 0.976

Over- and undersampling improved the accuracy of the model compared to the original, unbalanced data set, but not as much as the previous run appeared to indicate (which used the over/under-sampled data to form the validation folds).

An alternative way to properly use cross-validation while over-/under-sampling is to create the validation folds first (based on the original) data and then apply over-/under-sampling to the data after removing the \(k\)^thfold. This ensures that the sampling is applied only to the training data but not to the test data.

30.4 Surrogacy

Surrogate, Indicator, Index

In Section 5.2 we introduced quantification as the process to represent the world in measurable quantities. We have seen that some attributes are difficult to quantify, and quantification relies on surrogates and indicators. A surrogate metric (proxy metric) is one that is used in the place of another. An indicator is a quantitative or qualitative factor or variable that offers a direct, simple, unique and reliable signal or means to measure achievements.

Indicator and Index

When multiple indicators are combined, we sometimes call it an index, although the distinction is not precise. Indicators can also be the result of aggregation so that the distinction between indicator and index becomes one of degree of combining information (with an index being more aggregated or combining more individual pieces of information). In some domains, the word index is simply used more frequently than indicators, and vice versa.

Economic indicators, for example, are used to describe aspects of the economy:

The consumer price index is a surrogate metric for cost of living and inflation.
The unemployment rate and inflation are surrogates for the strength of the economy.
The GDP is a measure for the economic health of a nation.
The net promoter score is a surrogate metric for customer loyalty.

When quantifying a variable of interest is difficult, time consuming, expensive, or destructive, and a surrogate variable is easy to quantify, the surrogate can take the place of the variable of interest. You will come across many surrogates during feature engineering. Here are some examples:

Healthcare
- Cholesterol levels as a proxy for heart health.
- BMI has a proxy for obesity-related risks.
Education
- Standardized test scores as surrogates for student achievement.
- Graduation rates as a proxy for work force preparation.
Manufacturing
- Force to pull a nail through a gypsum board as a surrogate for dry wall quality.
Economy
- Gross Domestic Product as a surrogate for societal well-being.
- Consumer Price Index as a surrogate for purchasing power.
- Unemployment rate as a surrogate for economic strength.
Environmental Science
- CO2 levels as proxy for climate change.
- Tree canopy cover as a proxy for biodiversity.
Human Resources
- Measuring employee productivity by number of hours worked.
- Employee attrition as a surrogate for employee satisfaction.
Technology
- Number of downloads of an app as a measure of market share.
- Likes and shares as a measure of social media engagement
Academia
- Citation count as a surrogate for research impact.

Surrogate measurements are very often necessary, but they are not without issues:

They might not capture all factors influencing the attribute of interest.
They can lead to oversimplification and misinterpretation.
The surrogate might not measure what is most relevant. For example, standardized test scores do not measure critical thinking.
They can lead to surrogation.

Body Mass Index (BMI), the Bogus Index

What is it?

A great example of a surrogate-gone-mad is the body mass index (BMI). It is a medical screening tool for certain health conditions; it has become the standard indicator for obesity. Colloquially, it is understood as a measure of “fatness”. That is not quite correct, in most people, BMI only correlates with body fat.

The popularity of BMI is related to its easy derivation, it involves only two variables, a person’s weight and height, both of which can be measured precisely. With weight measured in kilograms (kg) and height in meters (m), the formula is simply \[ \text{BMI} = \frac{\text{Weight in kg}}{(\text{Height in m})^2} \]

Notice that the height is squared in the denominator. If you prefer to work in U.S. pounds and inches, the calculation is \[ \text{BMI} = \frac{\text{Weight in lbs} \times 703}{(\text{Height in inches})^2} \]

In my case (6’3” tall, 208 lbs), the BMI is (208 )/(75^2) = 25.9. According to BMI charts such as this one at the Cleveland Clinic, I am in the overweight range.

BMI is used widely by medical professionals. People with low BMI values might be at risk for developing anemia, osteoporosis, infertility, malnutrition, and a weakened immune system. High values can indicate a higher risk for heart disease, high blood pressure, type 2 diabetes, gallstones, osteoarthritis, sleep apnea, depression, and certain cancers.

Geez. It seems that unless you are in the optimal BMI range you are either bound for osteoporosis or osteoarthritis. No wonder folks are obsessing over their BMI.

The Cleveland Clinic is quick to point out:

It’s important to remember that you could have any of the above health conditions without having a high BMI. Similarly, you could have a high BMI without having any of these conditions.
…
It’s important to remember that body fatness isn’t the only determiner of overall health. Several other factors, such as genetics, activity level, smoking cigarettes or using tobacco, drinking alcohol and mental health conditions all affect your overall health and your likelihood of developing certain medical conditions.

And

The standard BMI chart has limitations for various reasons. Because of this, it’s important to not put too much emphasis on your BMI.

People who are muscular can have a high BMI and still have very low fat mass. The BMI does not distinguish between lean body mass and fat body mass. BMI charts do not distinguish between males and females, although females tend to have more body fat (says the Cleveland Clinic!). People today are taller than when the BMI was developed. The BMI charts do not apply to athletes, children, pregnant people, or the elderly.

Even though the BMI chart can be inaccurate for certain people, healthcare providers still use it because it’s the quickest tool for assessing a person’s estimated body fat amount.

Ah, so it is used because it is easy to calculate, not because it is particularly useful or accurate.

A January 2025 article in the medical journal The Lancet states (Rubino et al. 2025)

Current BMI-based measures of obesity can both underestimate and overestimate adiposity and provide inadequate information about health at the individual level, which undermines medically-sound approaches to health care and policy.

Based on this article, Business Insider went a step further, calling BMI bogus. Having “obesity” according to the BMI scale does not mean a person is unhealthy. In fact, BMI does not tell you anything about the health of a person.

Figure 30.4: Dwayne “The Rock” Johnson has a BMI of 34.3, which is classified as obese.

Origin of BMI

It is worthwhile to examine how BMI came about. The comment above about the increasing height of people suggests that BMI was developed some time ago. Indeed. It was invented for an entirely different reason, to describe a population average man in Western Europe in the 19^th century.

To make this connection we need to introduce Adolphe Quetelet (1796–1847), who invented the BMI to quantify a population according to the weight of its persons. It was initially called the Quetelet index. Quetelet was a Belgian astronomer, statistician, and mathematician—not a medical professional. He studied the distribution of physical attributes in populations of French and Scottish people. Quetelet determined that the normal, the most representative value of an attribute, is its average. Prior to Quetelet, the idea of “norm” and “normality” was associated with carpentry and construction. The carpenter square is also called the norm and in normal construction everything is at right angles. The classical notion of ideal as an unattainable beauty up to this time was reflected in great works of art.

Quetelet focused on the middle of the distribution as the “new normal” and saw l’homme moyen, the average man, as the ideal (Grue and Heiberg 2006).

There is no association with health, and there is no association with the individual. The BMI as developed by Quetelet was supposed to describe the average in a population, not obesity of the individual. The population it was intended to describe is French and Scottish of the 19^th century. Leaping from that application and that data to near universal measure of obesity since the 1970s is quite the stretch.

Surrogation

Surrogation occurs when people or organizations substitute a metric for the underlying concept or goal that the metric is intended to represent. Instead of treating the metric as a proxy or tool to measure progress, they treat it as the ultimate goal itself. The metric has become the goal. Surrogates are good. Surrogation is bad.

When standardized test scores are a surrogate for student achievement, surrogation means to focus on driving up the test scores rather than focusing on driving actual student achievement.

Someone can obsess over BMI (body mass index), trying to improve their BMI but not getting any healthier. Well, we now know why.

Surrogation of net promoter scores occurs when companies focus on driving up NPS by only asking customers that repeatedly buy their product. The goal is to increase customer satisfaction and loyalty, the NPS should increase as a result. Surrogation focuses on the NPS, not the actual customer satisfaction. The first mistake in surrogation is to assume that the surrogate is exactly what it is a proxy for. To assume that the net promoter score is a proxy for customer satisfaction it is not customer satisfaction.

Sales people are often incentivized on number of deals or deal volume (revenue). The underlying goal is to drive success for the company. Surrogation can lead to closing of deals that are bad for the company (losing money on the deal) but work in favor of the sales person’s metric to hit a revenue target.

When metrics are tied to performance evaluation, resource assignment, advancement, surrogation is common. It leads to distorted priorities and bad behavior (gaming the system): people try to meet the metric without achieving the underlying goal.

Another negative aspect of surrogation is to prioritize the simple fixes and quick wins instead of the deeper but more expensive and difficult corrections. A hospital might focus on amenities or quick ER service rather than improving health outcomes for its patients. A social media content creator generates click bait in order to increase clicks rather than focusing on meaningful engagement.