Handle Missing Data with Iterative Imputer by Scikit Learn
This method predicts missing values based on relationships with other features.
Introduction
Scikit Learn is an outstanding package. I have always been a huge fan of it, since I migrated to Data Science in 2019.
The module is easy to use and tremendously coherent. All their methods work essentially the same way, with an estimator fit_transform
or a fit
+ predict
style.
Recently, I have been seeing some experimental methods that add even more value to SkLearn.
One of those is the IterativeImputer
. According to the documentation, it is a multivariate imputer that estimates each feature from all the others.
Iterative Imputer estimates the missing values by modeling variables with missing data using the other features as predictors.
The method looks very clever. Let’s test it.
Iterative Imputer
We start importing some modules.
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
import pandas as pd
import numpy as np
import seaborn as sns
Next, let’s create a very simple use case, just so we understand how the algorithm works. We will create a small dataset where all the variables go from multiples of 1 to 4. We miss the numbers 3 and 20.
# Sample dataset with missing values
data = pd.DataFrame({
'feature1': [1, 2, None, 4],
'feature2': [10, None, 30, 40],
'feature3': [100, 200, 300, 400]
})
Next, we will fit and transform the data to estimate the missing data and finally print the result.
# Apply Iterative Imputer
imputer = IterativeImputer(random_state=42)
imputed_data = pd.DataFrame(imputer.fit_transform(data), columns=data.columns)
print(imputed_data)
----
[OUT]
feature1 feature2 feature3
0 1.0 10.0 100.0
1 2.0 20.0 200.0
2 3.0 30.0 300.0
3 4.0 40.0 400.0
As we can see, it worked just perfectly.
Ok, so let’s move up a notch now and use a dataset a little more complex (not that complex, really, but something that fits our educational purposes). We will load the famous mpg
dataset (open license BSD 3), then create a copy of the mpg
variable, and add 5 NAs
to it, so we can compare the results later.
# Load data
df = sns.load_dataset("mpg")
# Create a new column where we will create some NAs
df['mpg_na'] = df['mpg']
# Add 5 NAs in the dataset
df['mpg_na'] = df['mpg_na'].sample(frac=1).reset_index(drop=True)
df.loc[np.random.choice(df.index, 5, replace=False), 'mpg_na'] = np.nan
Here are the resulting NAs isolated.
df.query('mpg_na!=mpg_na')
Finally, we drop categorical data, since this method requires numbers as input. Now let’s fit it to the data and collect the results.
data = df.drop(['origin', 'name'], axis=1)
# Apply Iterative Imputer over DF
imputed_data = pd.DataFrame(imputer.fit_transform(data), columns=data.columns)
# Slice the data on the imputed data
imputed_data.iloc[[78, 131, 284, 298, 319]]
The data imputed does not seem too good compared to what we had originally. They were all floating around the mpg
mean (23.51457). So, IterativeImputer didn’t do a much better job than a simple imputer would do.
We can try with fewer variables too, but the result does not seem to improve.
data = df[['cylinders', 'horsepower', 'weight', 'mpg_na']]
# Apply Iterative Imputer over DF
imputed_data = pd.DataFrame(imputer.fit_transform(data), columns=data.columns)
# Slice the data on the imputed data
imputed_data.iloc[[78, 131, 284, 298, 319]]
Let’s try it one last time, using the titanic
dataset now. This dataset already has 177 NAs
on the variable age
.
# Load data
df = sns.load_dataset("titanic")
# Add a new column where NA = 1, other = 0
df['na'] = np.where(df['age'].isna(), 1, 0)
data = df.select_dtypes(include='number')
# Apply Iterative Imputer over DF
imputer = IterativeImputer(n_nearest_features=10,
sample_posterior=True,
max_iter = 100,
min_value= 0 ,
random_state=42)
imputed_data = pd.DataFrame(imputer.fit_transform(data), columns=data.columns)
# Slice the data on the imputed data
imputed_data.query('na==1')
This is the result of the IterativeImputer
.
It looks much better. The arguments used improved the iterations and enhanced the estimates. The results are not only around the mean this time.
n_nearest_features=10
: Number of other features to use to estimate the missing values of each feature column.sample_posterior=True
: Whether to sample from the (Gaussian) predictive posterior of the fitted estimator for each imputation.max_iter = 100
: Increases the number of iterations for better results.min_value= 0:
This is the minimum value that can be inputed.random_state=42
: Setting a seed for reproducibility.
Before You Go
Well, it is always great to learn new methods. sklearn
is a great module and they keep adding value. This method is still experimental, so it can be improved.
We had weak predictions with the algorithm “out of the box” for the mpg
dataset, but then we tweaked it when working with the titanic
. So, it is all about finding the right hyperparameters.
That’s a task I leave to you!
Check These Posts Too
If you like this quick content, clap, share, and follow me for more.
Find me on Linkedin.