# Impute Missing Values

June 01, 2019

Real world data is filled with missing values. You will often need to rid your data of these missing values in order to train a model or do meaningful analysis. What follows are a few ways to impute (fill) missing values in Python, for both numeric and categorical data.

# Imports

```
import pandas as pd
import numpy as np
```

# Imputation for Numeric Features

## Create a Toy Dataset

```
# create two columns of randomly generated values, replace a few examples with NaNs
data = {"X1": [np.nan, 0.7636183 , 0.61735332, 0.73848657, np.nan,
0.71623709, 0.73075927, np.nan, 0.71073827, 0.54693503],
"X2": [0.87505771, 0.77210971, 0.64369448, 0.54238232, 0.0710951 ,
0.6854597 , np.nan, 0.20935994, 0.54764129, np.nan ]}
df = pd.DataFrame(data)
print(df)
```

idx |
X1 |
X2 |
---|---|---|

0 | NaN | 0.875058 |

1 | 0.763618 | 0.772110 |

2 | 0.617353 | 0.643694 |

3 | 0.738487 | 0.542382 |

4 | NaN | 0.071095 |

5 | 0.716237 | 0.685460 |

6 | 0.730759 | NaN |

7 | NaN | 0.209360 |

8 | 0.710738 | 0.547641 |

9 | 0.546935 | NaN |

## Imputation Method 1: Mean or Median

A common method of imputation with numeric features is to replace missing values with the mean of the feature’s non-missing values. If the data have outliers, you may want to use the median instead. Either method is easy in Pandas:

```
# replace missing values with the column mean
df_mean_imputed = df.fillna(df.mean())
df_median_imputed = df.fillna(df.median())
df_mean_imputed
```

idx |
X1 |
X2 |
---|---|---|

0 | 0.689161 | 0.875058 |

1 | 0.763618 | 0.772110 |

2 | 0.617353 | 0.643694 |

3 | 0.738487 | 0.542382 |

4 | 0.689161 | 0.071095 |

5 | 0.716237 | 0.685460 |

6 | 0.730759 | 0.543350 |

7 | 0.689161 | 0.209360 |

8 | 0.710738 | 0.547641 |

9 | 0.546935 | 0.543350 |

## Imputation Method 2: Zero

Depending on where your data are coming from, a missing value may be better represented by the number zero. Replacing missing values with zeros is accomplished similar to the above method; just replace the mean function with zero.

```
# replace missing values with the column mean
df_zero_imputed = df.fillna(0)
df_zero_imputed
```

idx |
X1 |
X2 |
---|---|---|

0 | 0.000000 | 0.875058 |

1 | 0.763618 | 0.772110 |

2 | 0.617353 | 0.643694 |

3 | 0.738487 | 0.542382 |

4 | 0.000000 | 0.071095 |

5 | 0.716237 | 0.685460 |

6 | 0.730759 | 0.000000 |

7 | 0.000000 | 0.209360 |

8 | 0.710738 | 0.547641 |

9 | 0.546935 | 0.000000 |

# Imputation for Categorical Data

For categorical features, using mean, median, or zero-imputation doesn’t make much sense. Here I’ll create an example dataset with categorical features and show two imputation methods specific to this type of data.

## Create a Toy Dataset with Categorical Features

```
data = {"X1": [np.nan, "Red" , "Blue", "Red", np.nan,
"Red", "Green", np.nan, "Blue", "Red"],
"X2": ["Green", "Green", "Red", "Blue", "Green" ,
"Blue" , np.nan, "Red", "Green", np.nan ]}
colors = pd.DataFrame(data)
print(colors)
```

idx |
X1 |
X2 |
---|---|---|

0 | NaN | Green |

1 | Red | Green |

2 | Blue | Red |

3 | Red | Blue |

4 | NaN | Green |

5 | Red | Blue |

6 | Green | NaN |

7 | NaN | Red |

8 | Blue | Green |

9 | Red | NaN |

## Imputation Method 1: Most Common Class

One approach to imputing categorical features is to replace missing values with the most common class. You can do with by taking the index of the most common feature given in Pandas’ `value_counts`

function.

```
# for each column, get value counts in decreasing order and take the index (value) of most common class
df_most_common_imputed = colors.apply(lambda x: x.fillna(x.value_counts().index[0]))
df_most_common_imputed
```

idx |
X1 |
X2 |
---|---|---|

0 | Red | Green |

1 | Red | Green |

2 | Blue | Red |

3 | Red | Blue |

4 | Red | Green |

5 | Red | Blue |

6 | Green | Green |

7 | Red | Red |

8 | Blue | Green |

9 | Red | Green |

## Imputation Method 2: “Unknown” Class

Similar to how it’s sometimes most appropriate to impute a missing numeric feature with zeros, sometimes a categorical feature’s missing-ness itself is valuable information that should be explicitly encoded. If this is the case, most-common-class imputing would cause this information to be lost. Instead, just replace those values with a value like “Unknown” or “Missing.”

```
df_unknown_imputed = colors.fillna("Unknown")
df_unknown_imputed
```

idx |
X1 |
X2 |
---|---|---|

0 | Unknown | Green |

1 | Red | Green |

2 | Blue | Red |

3 | Red | Blue |

4 | Unknown | Green |

5 | Red | Blue |

6 | Green | Unknown |

7 | Unknown | Red |

8 | Blue | Green |

9 | Red | Unknown |

# One Final Tip: Column-Specific Imputation Rules

You can combine any of the above methods by imputing specific columns rather than the entire dataframe. Returning to the numeric example, we can mean-impute X1 and median-impute X2 by specifying the column(s) to be imputed.

```
# replace missing values with the column mean
df['X1'] = df['X1'].fillna(df['X1'].mean())
df['X2'] = df['X2'].fillna(df['X2'].median())
df
```

idx |
X1 |
X2 |
---|---|---|

0 | 0.689161 | 0.875058 |

1 | 0.763618 | 0.772110 |

2 | 0.617353 | 0.643694 |

3 | 0.738487 | 0.542382 |

4 | 0.689161 | 0.071095 |

5 | 0.716237 | 0.685460 |

6 | 0.730759 | 0.595668 |

7 | 0.689161 | 0.209360 |

8 | 0.710738 | 0.547641 |

9 | 0.546935 | 0.595668 |