12  Data Quality

12.1 Data Profiling

The First Date with your Data

Data profiling is one of the first steps to take when you encounter a data set for the first time. It is how you kick-start the exploratory data analysis (EDA). (Borne 2021) refers to it as “having that first date with your data.” We are not looking to derive new insights from the data or to build amazing machine learning models at this stage; we want to create a high-level report of the data’s content and condition. We want to know what we are dealing with. Common questions and issues addressed during profiling are

  • Which variables (attributes) are in the data?

  • How many rows and columns are there?

  • Which variables are quantitative (represent numbers), and which variables are qualitative (represent class memberships)

  • Are qualitative variables coded as strings, objects, numbers?

  • Are there complex data types such as JSON documents, images, audio, video encoded in the data?

  • What are the ranges (min, max) of the variables. Are these reasonable or do they suggest outliers or measurement errors?

  • What is the distribution of quantitative variables?

  • What is the mean, median, and standard deviation of quantitative variables?

  • What are the unique values of qualitative variables?

  • Do coded fields such as ZIP codes, account numbers, email addresses, state codes have the correct format?

  • Are there attributes that have only a single value?

  • Are there duplicate entries?

  • Are there missing values in one or more variables?

  • What are the strengths and direction of pairwise associations between the variables?

  • Are some attributes perfectly correlated, for example, birthdate and age or temperatures in degree Celsius and degree Fahrenheit.

California Housing Prices

In this subsection we consider a data set on housing prices in California, based on the 1990 census and available on Kaggle. The data contain information about geographic location, housing, and population within blocks. California has over 8,000 census tracts and a tract can have multiple block groups. There are over 20,000 census block groups and over 700,000 census blocks in California.

The variables in the data are:

Variables in the California Housing Prices data set.
Variable Description
longitude A measure of how far west a house is; a higher value is farther west
latitude A measure of how far north a house is; a higher value is farther north
housing_median_age Median age of a house within a block; a lower number is a newer building
total_rooms Total number of rooms within a block
total_bedrooms Total number of bedrooms within a block
population Total number of people residing within a block
households Total number of households, a group of people residing within a home unit, for a block
median_income Median income for households within a block of houses (measured in tens of thousands of US Dollars)
median_house_value Median house value for households within a block (measured in US Dollars)
ocean_proximity Location of house with respect to ocean/sea

The variable description is important metadata to understand the data. A variable such as totalRooms could be understood as the number of rooms in a building and the medianHouseValue could then mean the median of the houses that have that number of rooms. However, since the data are not for individual houses but blocks, totalRooms represents the sum of all the rooms in all the houses in that block.

We start data profiling a pandas DataFrame of the data with getting basic info and a listing of the first few observations.

import numpy as np
import pandas as pd
CA_houses = pd.read_csv("../datasets/CaliforniaHousing_1990.csv")

CA_houses.info()
CA_houses.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY

We can immediately answer several questions about the data:

  • There are 20,640 rows and 10 columns

  • Except for ocean_proximity, which is of object type (string), all other variables are stored as 64-bit floats. Variables total_rooms, total_bedrooms, population, and households are naturally integers; since they appeared in the CSV file with a decimal point, they were assigned a floating point data type.

  • Only the total_bedrooms variable has missing values, 20,640 – 20,433 = 207 values for this variable are unobserved. This shows a high level of completeness of the data set. (More on missing values in the next section).

  • The listing of the first five observations confirms that variables are counts or sums at the block-level, rather than data for individual houses.

  • The latitude and longitude values differ in the second decimal place, suggesting that blocks (=rows of the data set) have unique geographic location, but we cannot be sure.

  • The ocean_proximity entries are in all caps. We want to see the other values in that column to make sure the entries are consistent. Knowing the format of strings is important for filtering (selecting) or grouping observations.

The next step is to use pandas describe() function to compute basic summaries of the variables.

CA_houses.describe()
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
count 20640.000000 20640.000000 20640.000000 20640.000000 20433.000000 20640.000000 20640.000000 20640.000000 20640.000000
mean -119.569704 35.631861 28.639486 2635.763081 537.870553 1425.476744 499.539680 3.870671 206855.816909
std 2.003532 2.135952 12.585558 2181.615252 421.385070 1132.462122 382.329753 1.899822 115395.615874
min -124.350000 32.540000 1.000000 2.000000 1.000000 3.000000 1.000000 0.499900 14999.000000
25% -121.800000 33.930000 18.000000 1447.750000 296.000000 787.000000 280.000000 2.563400 119600.000000
50% -118.490000 34.260000 29.000000 2127.000000 435.000000 1166.000000 409.000000 3.534800 179700.000000
75% -118.010000 37.710000 37.000000 3148.000000 647.000000 1725.000000 605.000000 4.743250 264725.000000
max -114.310000 41.950000 52.000000 39320.000000 6445.000000 35682.000000 6082.000000 15.000100 500001.000000

For each of the numeric variables, describe() computes the number of non-missing values (count), the sample mean (mean), the sample standard deviation (std), the minimum (min), maximum (max) and three percentiles (25%, 50%, 75%).

The results confirm that all variables have complete data (no missing values) except for total_bedrooms. The min and max values are useful to see the range (range = max – min) for the variables and to spot outliers and unusual values. It is suspicious that there is one or more blocks with a single household. This is not necessarily the same record that has 2 total_rooms and a population of 3.

CA_houses[CA_houses["households"] == 1]
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
16171 -122.5 37.79 52.0 8.0 1.0 13.0 1.0 15.0001 500001.0 NEAR BAY

This is indeed a suspicious record. There is a single household in the block, but thirteen people are living in the block in a house with eight rooms and one bedroom. An unusual configuration that should be examined for possible data entry errors.

Profiling Tools

You can accelerate the data profiling task by using packages such as ydata_profiling (fka pandas_profiling), lux, or sweetviz. Sweetviz, for example, generates detailed interactive visualizations in a web browser or a notebook that help to address some of the profiling questions we raised at the beginning of the section.

To create a profile report for the housing prices data with sweetviz, use the following:

import sweetviz as sv
my_report = sv.analyze(CA_houses)
my_report.show_html(filepath='Profiling_CA_1.html', open_browser=False)
Report Profiling_CA_1.html was generated.

Figure 12.1 displays the main screen of the visualization. You can access the entire interactive html report here. For each numeric variable sweetviz reports the number of observed and missing values, the number of distinct values, and a series of summary statistics similar to the output from describe().  A histogram is also produced that gives an idea of the distribution of the variable in the data set. For example, housing_median_age has a fairly symmetric distribution, whereas total_rooms and total_bedrooms are highly concentrated despite a wide range.

Figure 12.2 shows the bottom of the main screen that displays information on the ocean_proximity variable. We now see that there are five unique values for the variable with the majority in the category <1H OCEAN.

Clicking on any variable brings up more details. For housing_median_age that detail is shown in Figure 12.3. It includes a detailed histogram of the distribution, largest, smallest, and most frequent values.

The graphics are interactive, the number of histogram columns can be changed to the desired resolution.

Sweetviz displays pairwise associations between variables. You can see those for housing_median_age in Figure 12.3 or for all pairs of variables by clicking on Associations (Figure 12.4).

Associations are calculated and displayed differently depending on whether the variables in a pair are quantitative or not. For pairs of quantitative variables, sweetviz computes the Pearson correlation coefficient. It ranges from –1 to +1; a coefficient of 0 indicates no (linear) relationship between the two variables, they are uncorrelated. A coefficient of +1 indicates a perfect positive correlation, knowing one variable allows you to perfectly predict the other variable. Similarly, a Pearson coefficient of –1 means that the variables are perfectly correlated and one variable decreases as the other increases.

Strong positive correlations are present between households and the variables total_rooms, total_bedrooms, and population. That is expected as these variables are accumulated across all households in a block. There is a moderate positive association between median income and median house value. More expensive houses are associated with higher incomes—not surprising. A strong negative correlation exists between longitude and latitude, a consequence of the geography of California: as you move further west (east) the state reaches further south (north). 

Associations between quantitative and qualitative variables are calculated as the correlation ratio that ranges from 0 to 1 and displayed as squares in the Associations matrix. The correlation ratio is based on means within the categories of the qualitative variables. A ratio of 0 means that the means of the quantitative variable are identical for all categories. Since the data contains only one qualitative variable, ocean_proximity, squares appear only in the last row and column of the Associations matrix.

If the data contains an obvious target variable for modeling, you can indicate that when creating the profiling report. Sweetviz then adds information on that variable to the visualizations. Suppose that we are interested in modeling the median house value as a function of other attributes. The following statement requests a report with median_house_value as the target.

housevalue_report = sv.analyze(CA_houses, target_feat="median_house_value")
housevalue_report.show_html(filepath='Profiling_CA_2.html', open_browser=False)
Report Profiling_CA_2.html was generated.

Figure 12.5 shows the detail on ocean_proximity from this analysis; the complete report is here. The average of the block’s median housing values in the five groups of ocean proximity are shown on top of the histogram. The highest average median house value is found on the island, the lowest average in the inland category.

12.2 Missing Values

When observations do not have values assigned to them, we say that the value is missing. This is a fact of life in data analytics; whenever you work with a set of data you should expect values to be missing.

Definition: Missing Value

A missing value is an observation that has no value assigned to it.

Missingness is obvious when you see incomplete columns in the data. The problem can be inconspicuous when entire records are missing from the data. A survey that fails to include a key demographic misses the records of those who should have been sampled in the survey.

You should check the software packages used for data analysis on how they handle missing values—by default and how the behavior can be affected through options. In many cases the default behavior is casewise deletion, also known as complete-case analysis: any record that have a missing value in one or more of the analysis variables is excluded from the analysis. Pairwise deletion removes only those records that have missing values for a specific analysis. To see the difference, consider the data in Table 12.1 and suppose you want to compute the matrix of correlations among the variables.

Table 12.1: Three variables with different missing value patterns.
\(X_1\) \(X_2\) \(X_3\)
1.0 3.0 .
2.9 . 3.4
3.8 . 8.2
0.5 3.7 .

A complete-case analysis of \(X_1\), \(X_2\) , and \(X_3\) would result in a data frame without observations since each row of the table has a missing value in one column. Pairwise deletion computes the correlation between \(X_1\) and \(X_2\) based on the first and last observation, the correlation between \(X_1\) and \(X_3\) based on the second and third observation and fail to compute a correlation between \(X_2\) and \(X_3\).

What are some possible causes for missing values:

  • Members of the target population not included (missing records)
  • Data entry errors
  • Variable transformations that lead to invalid values: division by zero, logarithm of zeros or negative values
  • Measurement equipment malfunction
  • Measurement equipment limits exceeded
  • Attrition (drop-outs) of subject in longitudinal studies (death, moving, refusal, changes in medical condition, …)
  • Nonresponse of subjects in surveys
  • Variables not measured
  • Not all combinations of factors are observable. For example, the data set of your Netflix movie ratings is extremely sparse, unless you “finished Netflix” and rated all movies.
  • Regulation requires removal of sensitive information

Data transformations can introduce missing values into data sets when mathematical operations are not valid. To accommodate nonlinear relationships between target and input variables, transformations of inputs such as ratios, square roots, and logarithms are common. These transformations are sometimes applied to change the distribution of data, for example, to create more symmetry by taking logarithms of right-skewed data (Figure 12.6).

The log-transformation is meaningful in the home values example, it is not unreasonable to proceed with an analysis that assumes log(value) is normally distributed. Suppose you are log-transforming another highly skewed variable, the amount of individual’s annual medical out-of-pocket expenses. Most people have a moderate amount of out-of-pocket expenses, a smaller percentage have very high annual expenses. However, many will have no out-of-pocket expenses at all. Taking the logarithm will invalidate the records of those individuals. To get around the numerical issue of taking logarithms of zeros, transformations are sometimes changed to log(expenses + 1). This avoids missing values but fudges the data by pretending that everyone has at least some medical expenses.

Removing missing values from the analysis is appropriate only when the reason for the missingness is unrelated to any other information in the study. The relationship between absence of information and the study is known as the missing value process.

Missing Value Process

Important

Making the wrong assumption about the missing value process can bias the results. A complete case analysis is not necessarily unbiased only if the data are missing completely at random. But the bias can at least be corrected in that case.

You need to be aware of three types of missing data based on that process:

  • MCAR: Data is said to be missing completely at random when the missingness is unrelated to any study variable, including the target variable. There are no systematic differences between the records with missing data and the records with complete data. MCAR is a very strong assumption, and it is the best you can hope for. If the data are MCAR, you can safely delete records with missing values because the complete cases are a representative sample of the whole. Case deletion reduces the size of the available data but does not introduce bias into the analysis.

  • MAR: Data is said to be missing at random when the pattern of missingness is related to the observed data but not to the unobserved data. Suppose you are conducting a survey regarding depression and mental health. If one group is less likely to provide information in the survey for reasons unrelated to their level of depression, then the group’s data is missing at random. Complete-case analysis of a data set that contains MAR data can result in bias.

  • MNAR: Data is said to be missing not at random if the absence of information is systematically related to the unobserved data. For example, employees do not report salaries in a workspace survey or a group that is less likely to report in a depression survey because of their level of depression.

A complete-case analysis if the data are MAR or MNAR does not necessarily bias the results. If the missingness is related to the primary target variable, then the results are biased. In the MAR case that bias can be corrected. As noted by the NIH in the context of patient studies,

The import of the MAR vs. MNAR distinction is therefore not to indicate that there definitively will or will not be bias in a complete case analysis, but instead to indicate – if the complete case analysis is biased – whether that bias can be fully removed in analysis.

Missing values are represented in data sets in different ways. The two basic methods are to use masks or extra bits to indicate whether a value is available and to use sentinel value, special entries that indicate that a value is not available (missing).

Definition: Sentinel Value

In programming, a sentinel value is a special placeholder that indicates a special condition in the data or the program. Applications of sentinel values are to indicate when to break out of loops or to indicate unobserved values.

Sentinel values such as –9999 to indicate a missing value are dangerous, they can be mistaken too easily for a valid numerical entry. The only sentinel value one should use is NaN (not-a-number), a specially defined IEEE floating-point value. Software implements special logic for handling NaNs. Unfortunately, NaN is available only for floating point data types, so software uses different techniques to implement missing values across all data types. For example, in databases you find masking based on the concept of NULL values to indicate absence (=nullity) of a value.

Caution

Do not use sentinel values that could be confused with real data values to indicate that a value is missing.

Missing Values in Pandas

Python has the singleton object None which can be used to indicate missingness and it supports the IEEE NaN (not a number) to indicate missing values for floating-point types. Pandas uses the sentinel value approach based on NaN for floating-point and None for all other data types. This choice has some side effects, None and NaN do not behave the same way.

import numpy as np
import pandas as pd

x1 = np.array([1, None, 3, 4])
x2 = np.array([1, 2, 3, 4])
display(x1)
display(x2)
array([1, None, 3, 4], dtype=object)
array([1, 2, 3, 4])

The array with None value is represented internally as an array of Python objects. Operating on objects is slower than on basic data types such as integers. Having missing values in non-floating-point columns thus incurs some drag on performance.

For floating point data use np.nan to indicate missingness.

f1 = np.array([1, np.nan, 3, 4])
f1
array([ 1., nan,  3.,  4.])

The behavior of None and NaN in operations is different. For example, arithmetic operations on NaNs result in NaNs, whereas arithmetic on None values results in errors.

f1.sum()
nan
x1.sum()
TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

While None values result in errors and stop program execution, NaNs are contagious; they turn everything they come in touch with into NaNs—but the program keeps executing. Pandas mixes None and NaN values and follows casting rules when np.nan is stored.

x2 = pd.Series([1,2,3,4], dtype=int)
display(x2)
x2[2] = np.nan
display(x2)
0    1
1    2
2    3
3    4
dtype: int64
0    1.0
1    2.0
2    NaN
3    4.0
dtype: float64

The integer series is converted to a float series when a NaN was inserted. The same happens when you use None instead of NaN:

x3 = pd.Series([1,2,3,4], dtype=int)
x3[1] = None
x3
0    1.0
1    NaN
2    3.0
3    4.0
dtype: float64

Working with Missing Values in Data Sets

The following statements create a Pandas DataFrame from a CSV file that contains information about 7,303 traffic collisions in New York City. You can use the info() attribute of the DataFrame for information about the columns, including missing value counts.

collisions = pd.read_csv("../datasets/nyc_collision_factors.csv")
collisions.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7303 entries, 0 to 7302
Data columns (total 26 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   DATE                           7303 non-null   object 
 1   TIME                           7303 non-null   object 
 2   BOROUGH                        6920 non-null   object 
 3   ZIP CODE                       6919 non-null   float64
 4   LATITUDE                       7303 non-null   float64
 5   LONGITUDE                      7303 non-null   float64
 6   LOCATION                       7303 non-null   object 
 7   ON STREET NAME                 6238 non-null   object 
 8   CROSS STREET NAME              6166 non-null   object 
 9   OFF STREET NAME                761 non-null    object 
 10  NUMBER OF PERSONS INJURED      7303 non-null   int64  
 11  NUMBER OF PERSONS KILLED       7303 non-null   int64  
 12  NUMBER OF PEDESTRIANS INJURED  7303 non-null   int64  
 13  NUMBER OF PEDESTRIANS KILLED   7303 non-null   int64  
 14  NUMBER OF CYCLISTS INJURED     0 non-null      float64
 15  NUMBER OF CYCLISTS KILLED      0 non-null      float64
 16  CONTRIBUTING FACTOR VEHICLE 1  7303 non-null   object 
 17  CONTRIBUTING FACTOR VEHICLE 2  6218 non-null   object 
 18  CONTRIBUTING FACTOR VEHICLE 3  303 non-null    object 
 19  CONTRIBUTING FACTOR VEHICLE 4  59 non-null     object 
 20  CONTRIBUTING FACTOR VEHICLE 5  14 non-null     object 
 21  VEHICLE TYPE CODE 1            7245 non-null   object 
 22  VEHICLE TYPE CODE 2            5783 non-null   object 
 23  VEHICLE TYPE CODE 3            284 non-null    object 
 24  VEHICLE TYPE CODE 4            54 non-null     object 
 25  VEHICLE TYPE CODE 5            12 non-null     object 
dtypes: float64(5), int64(4), object(17)
memory usage: 1.4+ MB

Columns DATE and TIME contain no null (missing) values, their count equals the number of observation (7,303). The number of cyclists injured or killed in the accidents in columns 14 and 15 contain only missing values. Is this a situation where the data was not entered, or should the entries be zeros? We should get back with the domain experts who put the data together to find out how to handle these columns.

Table 12.2 displays Pandas DataFrame methods to operate on missing values.

Table 12.2: Methods to operate on missing values in pandas DataFrames.
Method Description Notes
isnull() Returns a boolean same-sized object indicating if the missing values are missing isna() is an alias
notnull() Opposite of isnull(), indicating if values are not missing notna() is an alias
dropna() Remove missing values, dropping either rows (axis=0) or columns (axis=1) how={'any','all'} to determine when to drop a row or column
fillna() Replace missing values with designated values method= to propagate last valid value or backfill with next valid value
interpolate() Fill in missing values using an interpolation method

The isnull() method can be used to return a data frame of Boolean (True/False) values that indicate missingness. You can sum across rows or columns of the data frame to count the missing values:

collisions.isnull().sum()
DATE                                0
TIME                                0
BOROUGH                           383
ZIP CODE                          384
LATITUDE                            0
LONGITUDE                           0
LOCATION                            0
ON STREET NAME                   1065
CROSS STREET NAME                1137
OFF STREET NAME                  6542
NUMBER OF PERSONS INJURED           0
NUMBER OF PERSONS KILLED            0
NUMBER OF PEDESTRIANS INJURED       0
NUMBER OF PEDESTRIANS KILLED        0
NUMBER OF CYCLISTS INJURED       7303
NUMBER OF CYCLISTS KILLED        7303
CONTRIBUTING FACTOR VEHICLE 1       0
CONTRIBUTING FACTOR VEHICLE 2    1085
CONTRIBUTING FACTOR VEHICLE 3    7000
CONTRIBUTING FACTOR VEHICLE 4    7244
CONTRIBUTING FACTOR VEHICLE 5    7289
VEHICLE TYPE CODE 1                58
VEHICLE TYPE CODE 2              1520
VEHICLE TYPE CODE 3              7019
VEHICLE TYPE CODE 4              7249
VEHICLE TYPE CODE 5              7291
dtype: int64

If you choose to remove records with missing values, you can use the dropna() method. The how=’any’|’all’ option specifies whether to remove records if any variable is missing (complete-case analysis) or if all variables is missing. Because the columns referring to cyclists contain only missing values, a complete-case analysis will result in an empty DataFrame.

coll_no_miss = collisions.dropna(how='any')
display(len(collisions))
display(len(coll_no_miss))
7303
0

Suppose we verified that the missing values in columns 14 & 15 were meant to indicate that no cyclists were injured or killed. Then we can replace the missing values with zeros using the fillna() method.

collisions["NUMBER OF CYCLISTS INJURED"].fillna(0,inplace=True)
collisions["NUMBER OF CYCLISTS KILLED"].fillna(0,inplace=True)
/var/folders/ff/6m670wks1776ck_gbfxynxrc0000gq/T/ipykernel_76737/612697346.py:1: FutureWarning:

A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.



/var/folders/ff/6m670wks1776ck_gbfxynxrc0000gq/T/ipykernel_76737/612697346.py:2: FutureWarning:

A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


Techniques for imputing missing values are discussed in more detail below.

The notnull() method is useful to select records without missing values. Since it returns a boolean same-sized object, you can use it to filter:

bool_series = pd.notnull(collisions["CONTRIBUTING FACTOR VEHICLE 5"]) 
collisions.loc[bool_series, ["DATE", "TIME", "BOROUGH", "ON STREET NAME"]] 
DATE TIME BOROUGH ON STREET NAME
217 04/11/2016 13:22:00 BROOKLYN STUYVESANT AVENUE
896 10/08/2016 14:10:00 QUEENS NaN
1112 10/18/2016 07:10:00 QUEENS 124 STREET
1348 11/23/2016 14:11:00 BROOKLYN SUTTER AVENUE
1551 11/16/2016 11:11:00 QUEENS 112 AVENUE
1554 11/29/2016 04:11:00 BROOKLYN SNYDER AVENUE
2794 11/15/2016 03:11:00 MANHATTAN NaN
2899 03/07/2016 08:45:00 QUEENS 105 AVENUE
3292 10/25/2016 18:10:00 BROOKLYN HIGHLAND BOULEVARD
3813 10/03/2016 17:10:00 STATEN ISLAND MANOR ROAD
4346 02/08/2016 12:08:00 BROOKLYN EAST 35 STREET
5286 01/14/2016 20:00:00 NaN NaN
5556 01/07/2016 18:50:00 QUEENS 31 AVENUE
6737 01/11/2016 09:03:00 BROOKLYN BROADWAY

Visualizing Missing Value Patterns

The Missingno Python package has some nice methods to inspect the missing value patterns in data. This is helpful to see the missing value distribution across multiple columns. The matrix() method displays the missing value pattern for the DataFrame.

import missingno as msno
msno.matrix(collisions)

Columns without missing values (DATE, TIME) are shown as a solid gray bar. Missing values are displayed in white. The following graph shows the result of matrix() prior to filling in zeros in the cyclist columns. The sparkline at right summarizes the general shape of the data completeness and points out the rows with the maximum and minimum number of missing values in the dataset. At best 11 of the columns have missing values, at worst 23 of the 26 values are missing.

Table 12.3 contains data on property sales. The total value of the property is the sum of the first two columns, the last column is the ratio between sales price and total value. A missing value in one of the first two columns triggers a missing value in the Total column. If either Total or Sales are not present, the appraisal ratio in the last column must be missing.

Table 12.3: Data with column dependencies that propagate missing values.
Land Improvements Total Sale Appraisal Ratio
30000 64831 94831 118500 1.25
30000 50765 80765 93900 1.16
46651 18573 65224 . .
45990 91402 137392 184000 1.34
42394 . . 168000 .
. 133351 . 169000 .
63596 2182 65778 . .
56658 153806 210464 255000 1.21
51428 72451 123879 . .
93200 . . 422000 .
76125 78172 275297 290000 1.14
154360 61934 216294 237000 1.10
65376 . . 286500 .
42400 . . . .
40800 92606 133406 168000 1.26

The heatmap() method shows a matrix of nullity correlations between the columns of the data. Note that the CSV file contains dots (“.”) for the missing values. To make sure the data are correctly converted to numerical types and the dots are interpreted as missing values, the na_values= and skipinitalspace= options are added to pd.read_csv().

land = pd.read_csv("../datasets/landsales.csv", 
                   na_values=".",
                   skipinitialspace=True)

msno.heatmap(land)

The nullity correlations are Pearson correlation coefficients computed from the isnull() boolean object for the data, excluding columns that are completely observed or completely unobserved. A correlation of –1 means that presence/absence of two variables is perfectly correlated: if one variable appears the other variable does not appear. A correlation of +1 similarly means that the presence of one variable goes together with the presence of another variable. The total is not perfectly correlated with land or improve columns because a null value in either or both of these can cause a null value for the total. Similarly, the large correlations between appraisal & total and appraisal & sale are indicative that their missing values are likely to occur together.

Data Imputation

In a previous example we used the fillna() method to replace missing values with actual values: the unobserved values for the number of cyclists in the collisions data set were interpreted as no cyclists were injured, replacing NaNs with zeros. This is an example of data imputation.

Definition: Data Imputation

Data imputation is the process of replacing unobserved (missing) values with usable values.

Imputation must be carried out with care. It is tempting to replace absent values with numbers and to complete the data: records are not removed from the analysis, the sample size is maintained, and calculations no longer fail. Imputing values that are not representative introduces bias into the data.

Completing missing values based on information in other columns often seems simple on the surface, but it is fraught with difficulties—there be dragons! Suppose using address information to fill in missing zip codes. It is not sufficient to know that the city is Blacksburg. If we are talking about Blacksburg, SC, then we know the ZIP code is 29702. If it is Blacksburg, VA, however, then there are four possible ZIP codes; we need the street address to resolve to a unique value. Inferring a missing attribute such as gender should never be done. You cannot safely do it using names. Individuals might have chosen to not report gender information. You cannot afford to get it wrong.

If string-type data is missing, and you want to include them into the analysis, you can replace the missing values with an identifying string such as “Unobserved” or “Unknown”. That allows you to break out results for these observations in group-by observations, for example.

If you decide to proceed with imputation of missing values based on algorithms, here are some options:

  • Random replacement. Also called hot-deck imputation, the missing value is replaced with a randomly selected similar record in the same data set that has complete information.

  • LOCF. The missing value is replaced with the last complete observation preceding it: the last observation is carried forward. It is also called a forward fill. This method requires that the order of the observations in the data is somehow meaningful. If observations are grouped by city, it is probable that a missing value for the city column represents the same city as the previous record, unless the missing value falls on the record boundary between two cities.

  • Backfill. This is the opposite of LOCF; the next complete value following one or more missing values is propagated backwards.

  • Mean/Median imputation. This technique applies to numeric data; the missing value is replaced based on the sample mean, the sample median, or other statistical measures of location calculated from the non-missing values in a column. If the data consists of groups or classes, then group-specific means can be used. For example, if the data comprises age groups or genders, then missing values for a numeric variable can be replaced with averages for the variabler by age groups or genders.

  • Interpolation methods. For numerical data, missing values can be interpolated from nearby values. Interpolation calculations are based on linear, polynomial, or spline methods. Using the interpolate() method in pandas, you can choose between interpolating across rows or columns of the DataFrame.

  • Regression imputation. The column with missing values is treated as the target variable of a regression model, using one or more other columns as input variables of the regression. The missing values are then treated as unobserved observations for which the target is predicted.

  • Matrix completion. Based on principal component analysis, missing values in a \(r \times c\) numerical array are replaced with a low-rank approximation of the missing values based on the observed values.

  • Generative imputation. If the data consists of images or text and portions are unobserved, for example, parts of the image are obscured, then generative methods can be used to fill in missing pixels in the image or missing words in text.

To demonstrate some of these techniques, consider this simple 6 x 3 DataFrame.

np.random.seed(42)
df = pd.DataFrame(np.random.standard_normal((6, 3)))
df.iloc[:4,1] = np.nan
df.iloc[1:3,2] = np.nan
df
0 1 2
0 0.496714 NaN 0.647689
1 1.523030 NaN NaN
2 1.579213 NaN NaN
3 0.542560 NaN -0.465730
4 0.241962 -1.913280 -1.724918
5 -0.562288 -1.012831 0.314247

You can choose a common fill value for all columns or vary the value by column.

df.fillna({1:0.3, 2:0})
0 1 2
0 0.496714 0.300000 0.647689
1 1.523030 0.300000 0.000000
2 1.579213 0.300000 0.000000
3 0.542560 0.300000 -0.465730
4 0.241962 -1.913280 -1.724918
5 -0.562288 -1.012831 0.314247

Missing values in the second column are replaced with 0.3, those in the last column with zero. Forward (LOCF) and backward imputation are available by setting the method= parameter of fillna():

df.fillna(method="ffill")
/var/folders/ff/6m670wks1776ck_gbfxynxrc0000gq/T/ipykernel_76737/3944122520.py:1: FutureWarning:

DataFrame.fillna with 'method' is deprecated and will raise in a future version. Use obj.ffill() or obj.bfill() instead.
0 1 2
0 0.496714 NaN 0.647689
1 1.523030 NaN 0.647689
2 1.579213 NaN 0.647689
3 0.542560 NaN -0.465730
4 0.241962 -1.913280 -1.724918
5 -0.562288 -1.012831 0.314247

Notice that the forward fill does not replace the NaNs in the second column as there is no non-missing last value to be carried forward. The second and third row in the third column are imputed by carrying forward 0.647689 from the first row. With a backfill imputation the DataFrame is fully completed, since both columns have an observed value after the last missing value.

df.fillna(method="bfill")
/var/folders/ff/6m670wks1776ck_gbfxynxrc0000gq/T/ipykernel_76737/3896554658.py:1: FutureWarning:

DataFrame.fillna with 'method' is deprecated and will raise in a future version. Use obj.ffill() or obj.bfill() instead.
0 1 2
0 0.496714 -1.913280 0.647689
1 1.523030 -1.913280 -0.465730
2 1.579213 -1.913280 -0.465730
3 0.542560 -1.913280 -0.465730
4 0.241962 -1.913280 -1.724918
5 -0.562288 -1.012831 0.314247

To replace the missing values with the column-specific sample means, simply use

df.fillna(df.mean())
0 1 2
0 0.496714 -1.463056 0.647689
1 1.523030 -1.463056 -0.307178
2 1.579213 -1.463056 -0.307178
3 0.542560 -1.463056 -0.465730
4 0.241962 -1.913280 -1.724918
5 -0.562288 -1.012831 0.314247

The sample means of the non-missing observations in the second and third columns are -1.463056 and -0.307178, respectively. Imputing with the sample mean has the nice property to preserve the sample mean of the completed data:]

df.fillna(df.mean()).mean()
0    0.636865
1   -1.463056
2   -0.307178
dtype: float64

You can use other location statistics than the mean. The following statement imputes with the column-specific sample median:

df.fillna(df.median())
0 1 2
0 0.496714 -1.463056 0.647689
1 1.523030 -1.463056 -0.075741
2 1.579213 -1.463056 -0.075741
3 0.542560 -1.463056 -0.465730
4 0.241962 -1.913280 -1.724918
5 -0.562288 -1.012831 0.314247

To demonstrate imputation by interpolation, consider this simple series:

s = pd.Series([0, 2, np.nan, 8])
s
0    0.0
1    2.0
2    NaN
3    8.0
dtype: float64

The default interpolation method is linear interpolation. You can choose other method, for example, spline or polynomial interpolation, with the method= option.

s.interpolate()
s.interpolate(method='polynomial', order=2)
0    0.000000
1    2.000000
2    4.666667
3    8.000000
dtype: float64

To see interpolation operate on a DataFrame, consider this example:

df = pd.DataFrame([(0.0, np.nan, -1.0, 1.0),
                    (np.nan, 2.0, np.nan, np.nan),
                    (2.0, 3.0, np.nan, 9.0),
                    (np.nan, 4.0, -4.0, 16.0)],
                   columns=list('abcd'))
df
a b c d
0 0.0 NaN -1.0 1.0
1 NaN 2.0 NaN NaN
2 2.0 3.0 NaN 9.0
3 NaN 4.0 -4.0 16.0

The default interpolation method is linear with a forward direction across rows.

df.interpolate()
a b c d
0 0.0 NaN -1.0 1.0
1 1.0 2.0 -2.0 5.0
2 2.0 3.0 -3.0 9.0
3 2.0 4.0 -4.0 16.0

Because there was no observed value preceding the missing value in column b, the NaN cannot be interpolated. Because there was no value past the last missing value in column a, the NaN is replaced with the last value carried forward.

To interpolate in the forward and backward direction, use limit_direction=’both’:

df.interpolate(method='linear', limit_direction='both')
a b c d
0 0.0 2.0 -1.0 1.0
1 1.0 2.0 -2.0 5.0
2 2.0 3.0 -3.0 9.0
3 2.0 4.0 -4.0 16.0

To interpolate across the columns, set the axis= option to axis=1:

df.interpolate(method='linear', limit_direction='forward', axis=1)
a b c d
0 0.0 -0.5 -1.0 1.0
1 NaN 2.0 2.0 2.0
2 2.0 3.0 6.0 9.0
3 NaN 4.0 -4.0 16.0

Imputation seems convenient and relatively simple, and there are many methods to choose from. As you can see from this discussion, there are also many issues and pitfalls. Besides introducing potential bias by imputing with bad values, an important issue is the level of uncertainty associated with imputed values. Unless the data were measured with error or entered incorrectly, you have confidence in the observed values. You cannot have the same level of confidence in the imputed values. The imputed values are estimates based on processes that involve randomness. One source of randomness is that our data represents a random sample—a different set of data gives a different estimate for the missing value. Another source of randomness is the imputation method itself. For example, hot-deck imputation chooses the imputed value based on randomly selected observations with complete data.

Accounting for these multiple levels of uncertainty is non-trivial. If you pass imputed data to any statistical algorithm, it will assume that all values are known with the same level of confidence. There are ways to take imputation uncertainty into account. You can assign weights to the observations where the weight is proportional to your level of confidence. Assigning smaller weights to imputed values reduces their impact on the analysis. A better approach is to use bootstrap techniques to capture the true uncertainty and bias in the quantities derived from imputed data. This is more computer intensive than a weighted analysis but very doable with today’s computing power.

Definition: Bootstrap

To bootstrap a data set is to repeatedly sample from the data with replacement. Suppose you have a data set of size \(n\) rows. \(B\) = 1,000 bootstrap samples are 1,000 data sets of size \(n\), each drawn independently from each other, and the observations in each bootstrap sample are drawn with replacement.

Bootstrapping is a statistical technique to derive the sample distribution of a random quantity by simulation. You apply the technique to each bootstrap sample and average the results.

I hope you take away from this discussion that imputation is possible, imputation is not always necessary, and imputation must be done with care.

12.3 Outlier and Anomalies

12.4 Unbalanced Data