How to Use Mean Imputation to Replace Missing Values in Python?

 

Mean imputation is a technique used in statistics to fill in missing values in a data set. This is done by replacing the missing value with the mean of the remaining values in the data set. Traditionally, Mean imputation is a common technique used when dealing with survey data, where it is often difficult to collect information from all respondents. Nowadays you can still use mean imputation in your data science project to impute missing values.

 

Advantages of mean imputation?

There are several advantages to mean imputation in statistics. Mean imputation allows for the replacement of missing data with a plausible value, which can improve the accuracy of the analysis. Additionally, mean imputation can help to reduce the bias in the results of a study by limiting the effects of extreme outliers. Mean imputation is commonly used to replace missing data when the mean, median, or mode of a variable’s distribution is missing. Additionally, mean imputation is often used to address ordinal and interval variables that are not normally distributed. This approach should be employed with care, as it can sometimes result in significant bias.

 

For example, if 5 percent of cases were randomly removed from a survey sample of 1000 people, then the distribution of missing values would generally be skewed. That is, most cases that are missing data would have low values on a given outcome variable. This would in turn lead to an underestimation of the proportion of cases with missing data.

 

Disadvantages of mean imputation

Mean imputation is not always applicable, however. It is only reasonable if the distribution of the variable is known. This means that it cannot be used in situations where values are missing due to measurement error, as is the case with some psychological tests. 

 

There are several disadvantages to using mean imputation. 

  • First, it can introduce bias into the data. 
  • Second, it can lead to inaccurate estimates of variability and standard errors. 
  • Third, it can produce unstable estimates of coefficients and standard errors.
  • Fourth, it can produce biased estimates of the population mean and standard deviation. 
  • Finally, it can produce imputations that are not representative of the underlying data. 

 

How to implement mean imputation?

The following steps are used to implement the mean imputation procedure:

  1. Choose an imputation method. The choice of the imputation method depends on the data set. There are many different methods to impute missing values in a dataset. The imputation aims to assign missing values a value from the data set. The mean imputation method produces a mean estimate for the missing value, which is then plugged into the original equation.
  2. Define the mean of the data set. It is important to ensure that this estimate is a consistent estimate of the missing value. If it is not, the mean imputation can lead to biased results. This concept is discussed in greater detail later in this article. 
  3. Impute the missing values and calculate the mean imputation. The process of calculating the mean imputation with python is described in the next section.
  4. Return the mean imputed values to your original dataset. You can either decide to replace the values of your original dataset or make a copy onto another one.

 

How to perform mean imputation with python?

Let us first initialize our data and create the dataframe and import the relevant libraries.

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

df = pd.DataFrame()

# add two features with 1 missing value each
df['f1'] = [1,4,2,3,4,5,6,np.nan,5,4]
df['f2'] = [np.nan,2,2,5,4,8,5,3,6,5]

print(df)
    f1   f2
0  1.0  NaN
1  4.0  2.0
2  2.0  2.0
3  3.0  5.0
4  4.0  4.0
5  5.0  8.0
6  6.0  5.0
7  NaN  3.0
8  5.0  6.0
9  4.0  5.0

Next, we will use sklearn SimpleImputer to apply the imputations. It’s simple as telling the SimpleImputer object to target the NaN and use the mean as a replacement value. Intuitively, you have to understand that the mean may not be your only option here, you can use the median or a constant as well.

# Initialize the imputers, by setting what values we want to impute and the strategy to use
mean_imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

# Fit the imputer on to the dataset
mean_imputer = mean_imputer.fit(df)


# Apply the imputation
results = mean_imputer.transform(df.values)
results.round()
array([[1., 4.],
       [4., 2.],
       [2., 2.],
       [3., 5.],
       [4., 4.],
       [5., 8.],
       [6., 5.],
       [4., 3.],
       [5., 6.],
       [4., 5.]])

Simple imputation does not only work on numerical values, it works on categorical values as well. You just need to set the strategy as either most common or constant. Similarly, you can use the imputer on not only dataframes, but on NumPy matrices and sparse matrices as well. 

import pandas as pd
df = pd.DataFrame([["Alfred", "John"],
                  [np.nan, "Steven"],
                  ["Alfred", np.nan],
                  ["Henry", "Steven"]], dtype="category")

mean_imputer = SimpleImputer(strategy="most_frequent")
print(mean_imputer.fit_transform(df))
[['Alfred' 'John']
 ['Alfred' 'Steven']
 ['Alfred' 'Steven']
 ['Henry' 'Steven']]

 

From these two examples, using sklearn should be slightly more intuitive. You just need to tell your imputation strategy > fit it onto your dataset > transform said dataset. You can find a full list of the parameters you can use for the SimpleInputer in Sklearn documentation.

If you made this far in the article, thank you very much.

 

I hope this information was of use to you. 

 

Feel free to use any information from this page. I’d appreciate it if you can simply link to this article as the source. If you have any additional questions, you can reach out to malick@malicksarr.com  or message me on Twitter. If you want more content like this, join my email list to receive the latest articles. I promise I do not spam. 

 

Newsletter

 

 

 

 

Leave a Comment