Impute Missing Values

June 01, 2019

Real world data is filled with missing values. You will often need to rid your data of these missing values in order to train a model or do meaningful analysis. What follows are a few ways to impute (fill) missing values in Python, for both numeric and categorical data.

Imports

import pandas as pd
import numpy as np

Imputation for Numeric Features

Create a Toy Dataset

# create two columns of randomly generated values, replace a few examples with NaNs
data = {"X1": [np.nan, 0.7636183 , 0.61735332, 0.73848657, np.nan,
        0.71623709, 0.73075927, np.nan, 0.71073827, 0.54693503],
        "X2": [0.87505771, 0.77210971, 0.64369448, 0.54238232, 0.0710951 ,
        0.6854597 , np.nan, 0.20935994, 0.54764129, np.nan ]}
df = pd.DataFrame(data)
print(df)

idx	X1	X2
0	NaN	0.875058
1	0.763618	0.772110
2	0.617353	0.643694
3	0.738487	0.542382
4	NaN	0.071095
5	0.716237	0.685460
6	0.730759	NaN
7	NaN	0.209360
8	0.710738	0.547641
9	0.546935	NaN

Imputation Method 1: Mean or Median

A common method of imputation with numeric features is to replace missing values with the mean of the feature’s non-missing values. If the data have outliers, you may want to use the median instead. Either method is easy in Pandas:

# replace missing values with the column mean
df_mean_imputed = df.fillna(df.mean())
df_median_imputed = df.fillna(df.median())
df_mean_imputed

idx	X1	X2
0	0.689161	0.875058
1	0.763618	0.772110
2	0.617353	0.643694
3	0.738487	0.542382
4	0.689161	0.071095
5	0.716237	0.685460
6	0.730759	0.543350
7	0.689161	0.209360
8	0.710738	0.547641
9	0.546935	0.543350

Imputation Method 2: Zero

Depending on where your data are coming from, a missing value may be better represented by the number zero. Replacing missing values with zeros is accomplished similar to the above method; just replace the mean function with zero.

# replace missing values with the column mean
df_zero_imputed = df.fillna(0)
df_zero_imputed

idx	X1	X2
0	0.000000	0.875058
1	0.763618	0.772110
2	0.617353	0.643694
3	0.738487	0.542382
4	0.000000	0.071095
5	0.716237	0.685460
6	0.730759	0.000000
7	0.000000	0.209360
8	0.710738	0.547641
9	0.546935	0.000000

Imputation for Categorical Data

For categorical features, using mean, median, or zero-imputation doesn’t make much sense. Here I’ll create an example dataset with categorical features and show two imputation methods specific to this type of data.

Create a Toy Dataset with Categorical Features

data = {"X1": [np.nan, "Red" , "Blue", "Red", np.nan,
        "Red", "Green", np.nan, "Blue", "Red"],
        "X2": ["Green", "Green", "Red", "Blue", "Green" ,
        "Blue" , np.nan, "Red", "Green", np.nan ]}
colors = pd.DataFrame(data)
print(colors)

idx	X1	X2
0	NaN	Green
1	Red	Green
2	Blue	Red
3	Red	Blue
4	NaN	Green
5	Red	Blue
6	Green	NaN
7	NaN	Red
8	Blue	Green
9	Red	NaN

Imputation Method 1: Most Common Class

One approach to imputing categorical features is to replace missing values with the most common class. You can do with by taking the index of the most common feature given in Pandas’ value_counts function.

# for each column, get value counts in decreasing order and take the index (value) of most common class
df_most_common_imputed = colors.apply(lambda x: x.fillna(x.value_counts().index[0]))
df_most_common_imputed

idx	X1	X2
0	Red	Green
1	Red	Green
2	Blue	Red
3	Red	Blue
4	Red	Green
5	Red	Blue
6	Green	Green
7	Red	Red
8	Blue	Green
9	Red	Green

Imputation Method 2: “Unknown” Class

Similar to how it’s sometimes most appropriate to impute a missing numeric feature with zeros, sometimes a categorical feature’s missing-ness itself is valuable information that should be explicitly encoded. If this is the case, most-common-class imputing would cause this information to be lost. Instead, just replace those values with a value like “Unknown” or “Missing.”

df_unknown_imputed = colors.fillna("Unknown")
df_unknown_imputed

idx	X1	X2
0	Unknown	Green
1	Red	Green
2	Blue	Red
3	Red	Blue
4	Unknown	Green
5	Red	Blue
6	Green	Unknown
7	Unknown	Red
8	Blue	Green
9	Red	Unknown

One Final Tip: Column-Specific Imputation Rules

You can combine any of the above methods by imputing specific columns rather than the entire dataframe. Returning to the numeric example, we can mean-impute X1 and median-impute X2 by specifying the column(s) to be imputed.

# replace missing values with the column mean
df['X1'] = df['X1'].fillna(df['X1'].mean())
df['X2'] = df['X2'].fillna(df['X2'].median())
df

idx	X1	X2
0	0.689161	0.875058
1	0.763618	0.772110
2	0.617353	0.643694
3	0.738487	0.542382
4	0.689161	0.071095
5	0.716237	0.685460
6	0.730759	0.595668
7	0.689161	0.209360
8	0.710738	0.547641
9	0.546935	0.595668