Handle Missing Data with Iterative Imputer by Scikit Learn

Gustavo R Santos
5 min readNov 22, 2024

This method predicts missing values based on relationships with other features.

Image created by Open AI DALL•E, 2024. https://openai.com. Missing Data.

Introduction

Scikit Learn is an outstanding package. I have always been a huge fan of it, since I migrated to Data Science in 2019.

The module is easy to use and tremendously coherent. All their methods work essentially the same way, with an estimator fit_transform or a fit + predict style.

Recently, I have been seeing some experimental methods that add even more value to SkLearn.

One of those is the IterativeImputer. According to the documentation, it is a multivariate imputer that estimates each feature from all the others.

Iterative Imputer estimates the missing values by modeling variables with missing data using the other features as predictors.

The method looks very clever. Let’s test it.

Iterative Imputer

We start importing some modules.

from sklearn.experimental import enable_iterative_imputer  
from sklearn.impute import IterativeImputer
import pandas as pd
import numpy as np
import seaborn as sns

Next, let’s create a very simple use case, just so we understand how the algorithm works. We will create a small dataset where all the variables go from multiples of 1 to 4. We miss the numbers 3 and 20.

# Sample dataset with missing values  
data = pd.DataFrame({
'feature1': [1, 2, None, 4],
'feature2': [10, None, 30, 40],
'feature3': [100, 200, 300, 400]
})
Example data created. Image by the author.

Next, we will fit and transform the data to estimate the missing data and finally print the result.

# Apply Iterative Imputer  
imputer = IterativeImputer(random_state=42)
imputed_data = pd.DataFrame(imputer.fit_transform(data), columns=data.columns)

print(imputed_data)

----

[OUT]
feature1 feature2 feature3
0 1.0 10.0 100.0
1 2.0 20.0 200.0
2 3.0 30.0 300.0
3 4.0 40.0 400.0

As we can see, it worked just perfectly.

Ok, so let’s move up a notch now and use a dataset a little more complex (not that complex, really, but something that fits our educational purposes). We will load the famous mpg dataset (open license BSD 3), then create a copy of the mpg variable, and add 5 NAs to it, so we can compare the results later.

# Load data
df = sns.load_dataset("mpg")

# Create a new column where we will create some NAs
df['mpg_na'] = df['mpg']

# Add 5 NAs in the dataset
df['mpg_na'] = df['mpg_na'].sample(frac=1).reset_index(drop=True)
df.loc[np.random.choice(df.index, 5, replace=False), 'mpg_na'] = np.nan

Here are the resulting NAs isolated.

df.query('mpg_na!=mpg_na')
NAs intentionally added to the dataset. Image by the author.

Finally, we drop categorical data, since this method requires numbers as input. Now let’s fit it to the data and collect the results.

data = df.drop(['origin', 'name'], axis=1)

# Apply Iterative Imputer over DF
imputed_data = pd.DataFrame(imputer.fit_transform(data), columns=data.columns)

# Slice the data on the imputed data
imputed_data.iloc[[78, 131, 284, 298, 319]]
Imputed data. Image by the author.

The data imputed does not seem too good compared to what we had originally. They were all floating around the mpg mean (23.51457). So, IterativeImputer didn’t do a much better job than a simple imputer would do.

We can try with fewer variables too, but the result does not seem to improve.

data = df[['cylinders', 'horsepower', 'weight', 'mpg_na']]

# Apply Iterative Imputer over DF
imputed_data = pd.DataFrame(imputer.fit_transform(data), columns=data.columns)

# Slice the data on the imputed data
imputed_data.iloc[[78, 131, 284, 298, 319]]
Imputed data with fewer variables. Image by the author.

Let’s try it one last time, using the titanic dataset now. This dataset already has 177 NAs on the variable age.

# Load data
df = sns.load_dataset("titanic")

# Add a new column where NA = 1, other = 0
df['na'] = np.where(df['age'].isna(), 1, 0)

data = df.select_dtypes(include='number')

# Apply Iterative Imputer over DF
imputer = IterativeImputer(n_nearest_features=10,
sample_posterior=True,
max_iter = 100,
min_value= 0 ,
random_state=42)

imputed_data = pd.DataFrame(imputer.fit_transform(data), columns=data.columns)

# Slice the data on the imputed data
imputed_data.query('na==1')

This is the result of the IterativeImputer.

Imputer results on the Titanic data. Image by the author.

It looks much better. The arguments used improved the iterations and enhanced the estimates. The results are not only around the mean this time.

  • n_nearest_features=10: Number of other features to use to estimate the missing values of each feature column.
  • sample_posterior=True: Whether to sample from the (Gaussian) predictive posterior of the fitted estimator for each imputation.
  • max_iter = 100: Increases the number of iterations for better results.
  • min_value= 0: This is the minimum value that can be inputed.
  • random_state=42: Setting a seed for reproducibility.

Before You Go

Well, it is always great to learn new methods. sklearn is a great module and they keep adding value. This method is still experimental, so it can be improved.

We had weak predictions with the algorithm “out of the box” for the mpgdataset, but then we tweaked it when working with the titanic. So, it is all about finding the right hyperparameters.

That’s a task I leave to you!

Check These Posts Too

If you like this quick content, clap, share, and follow me for more.

Find me on Linkedin.

Code on GitHub

References

--

--

Gustavo R Santos
Gustavo R Santos

Written by Gustavo R Santos

Data Scientist | I solve business challenges through the power of data. | Visit my site: https://gustavorsantos.me

No responses yet