Handle Missing Data with Iterative Imputer by Scikit Learn

5 min readNov 22, 2024

This method predicts missing values based on relationships with other features.

Image created by Open AI **DALL•E**, 2024. https://openai.com. Missing Data.

Introduction

Scikit Learn is an outstanding package. I have always been a huge fan of it, since I migrated to Data Science in 2019.

The module is easy to use and tremendously coherent. All their methods work essentially the same way, with an estimator fit_transform or a fit + predict style.

Recently, I have been seeing some experimental methods that add even more value to SkLearn.

One of those is the IterativeImputer. According to the documentation, it is a multivariate imputer that estimates each feature from all the others.

Iterative Imputer estimates the missing values by modeling variables with missing data using the other features as predictors.

The method looks very clever. Let’s test it.

Iterative Imputer

We start importing some modules.

from sklearn.experimental import enable_iterative_imputer  
from sklearn.impute import IterativeImputer  
import pandas as pd
import numpy as np
import seaborn as sns

Next, let’s create a very simple use case, just so we understand how the algorithm works. We will create a small dataset where all the variables go from multiples of 1 to 4. We miss the numbers 3 and 20.

# Sample dataset with missing values  
data = pd.DataFrame({
    'feature1': [1, 2, None, 4],
    'feature2': [10, None, 30, 40],
    'feature3': [100, 200, 300, 400]
})

Example data created. Image by the author.

Next, we will fit and transform the data to estimate the missing data and finally print the result.

# Apply Iterative Imputer  
imputer = IterativeImputer(random_state=42)  
imputed_data = pd.DataFrame(imputer.fit_transform(data), columns=data.columns)  

print(imputed_data)

----

[OUT]
   feature1  feature2  feature3
0       1.0      10.0     100.0
1       2.0      20.0     200.0
2       3.0      30.0     300.0
3       4.0      40.0     400.0

As we can see, it worked just perfectly.

Ok, so let’s move up a notch now and use a dataset a little more complex (not that complex, really, but something that fits our educational purposes). We will load the famous mpg dataset (open license BSD 3), then create a copy of the mpg variable, and add 5 NAs to it, so we can compare the results later.

# Load data
df = sns.load_dataset("mpg")

# Create a new column where we will create some NAs
df['mpg_na'] = df['mpg']

# Add 5 NAs in the dataset
df['mpg_na'] = df['mpg_na'].sample(frac=1).reset_index(drop=True)
df.loc[np.random.choice(df.index, 5, replace=False), 'mpg_na'] = np.nan

Here are the resulting NAs isolated.

df.query('mpg_na!=mpg_na')

NAs intentionally added to the dataset. Image by the author.

Finally, we drop categorical data, since this method requires numbers as input. Now let’s fit it to the data and collect the results.

data = df.drop(['origin', 'name'], axis=1)

# Apply Iterative Imputer over DF  
imputed_data = pd.DataFrame(imputer.fit_transform(data), columns=data.columns)

# Slice the data on the imputed data
imputed_data.iloc[[78, 131, 284, 298, 319]]

The data imputed does not seem too good compared to what we had originally. They were all floating around the mpg mean (23.51457). So, IterativeImputer didn’t do a much better job than a simple imputer would do.

We can try with fewer variables too, but the result does not seem to improve.

data = df[['cylinders', 'horsepower', 'weight', 'mpg_na']]

# Apply Iterative Imputer over DF  
imputed_data = pd.DataFrame(imputer.fit_transform(data), columns=data.columns)

# Slice the data on the imputed data
imputed_data.iloc[[78, 131, 284, 298, 319]]

Imputed data with fewer variables. Image by the author.

Let’s try it one last time, using the titanic dataset now. This dataset already has 177 NAs on the variable age.

# Load data
df = sns.load_dataset("titanic")

# Add a new column where NA = 1, other = 0
df['na'] = np.where(df['age'].isna(), 1, 0) 

data = df.select_dtypes(include='number')

# Apply Iterative Imputer over DF
imputer = IterativeImputer(n_nearest_features=10, 
                           sample_posterior=True, 
                           max_iter = 100,
                           min_value= 0 ,
                           random_state=42)

imputed_data = pd.DataFrame(imputer.fit_transform(data), columns=data.columns)

# Slice the data on the imputed data
imputed_data.query('na==1')

This is the result of the IterativeImputer.

Imputer results on the Titanic data. Image by the author.

It looks much better. The arguments used improved the iterations and enhanced the estimates. The results are not only around the mean this time.

n_nearest_features=10: Number of other features to use to estimate the missing values of each feature column.
sample_posterior=True: Whether to sample from the (Gaussian) predictive posterior of the fitted estimator for each imputation.
max_iter = 100: Increases the number of iterations for better results.
min_value= 0: This is the minimum value that can be inputed.
random_state=42: Setting a seed for reproducibility.

Before You Go

Well, it is always great to learn new methods. sklearn is a great module and they keep adding value. This method is still experimental, so it can be improved.

We had weak predictions with the algorithm “out of the box” for the mpgdataset, but then we tweaked it when working with the titanic. So, it is all about finding the right hyperparameters.

That’s a task I leave to you!

Check These Posts Too

Use Partial Dependence Plot for Feature Selection

A good tool to use in Python for model evaluation and feature selection.

gustavorsantos.medium.com

Be More Productive Using AI

You can use the powerful Generative AI models to make much more in less time.

gustavorsantos.medium.com

A Closer Look at Scipy’s Stats module — Part 1

Let’s learn the main methods from scipy.stats module in Python.

towardsdatascience.com

A Closer Look at Scipy’s Stats Module — Part 2

Let’s learn the main methods from scipy.stats module in Python.

towardsdatascience.com

If you like this quick content, clap, share, and follow me for more.

Gustavo Santos - Medium

Read writing from Gustavo Santos on Medium. Data Scientist. I extract insights from data to help people and companies…

gustavorsantos.medium.com

Find me on Linkedin.

Code on GitHub

Studying/Python/sklearn/Iterative Imputer.ipynb at master · gurezende/Studying

This is a repository with my tests and studies of new packages - Studying/Python/sklearn/Iterative Imputer.ipynb at…

github.com

References

IterativeImputer

Gallery examples: Imputing missing values before building an estimator Imputing missing values with variants of…

scikit-learn.org

Handle Missing Data with Iterative Imputer by Scikit Learn

Introduction

Iterative Imputer

Before You Go

Check These Posts Too

Use Partial Dependence Plot for Feature Selection

A good tool to use in Python for model evaluation and feature selection.

Be More Productive Using AI

You can use the powerful Generative AI models to make much more in less time.

A Closer Look at Scipy’s Stats module — Part 1

Let’s learn the main methods from scipy.stats module in Python.

A Closer Look at Scipy’s Stats Module — Part 2

Let’s learn the main methods from scipy.stats module in Python.

Gustavo Santos - Medium

Read writing from Gustavo Santos on Medium. Data Scientist. I extract insights from data to help people and companies…

Code on GitHub

Studying/Python/sklearn/Iterative Imputer.ipynb at master · gurezende/Studying

This is a repository with my tests and studies of new packages - Studying/Python/sklearn/Iterative Imputer.ipynb at…

References

IterativeImputer

Gallery examples: Imputing missing values before building an estimator Imputing missing values with variants of…

Written by Gustavo R Santos

No responses yet