How to Standardize your Data ? [Data Standardization with Python]

 

Data standardization is a preprocessing step that you will be performing before running your model. Indeed, in some projects, you may improve the performance of your models by standardizing some of your features. In this article, I will be showing you how to standardize your data using Python.

 

What is data standardization?

The idea behind standardization is to rescale your feature (column or variable) to have a mean of 0 and a unit variance. It is usually represented by the formula,  Z = \frac{x_{i} - \mu}{s}  where:

– Z is the standardized value

– U is the mean of the training samples. U can be equal to zero if you set with_mean =False

– S is the standard deviation of the training sample. S can be set to 1 if you call with_std=False

 

How to standardize your data with Python

Standardization is a simple task to perform in Python. You will achieve it in a couple of lines of code.

 

First, let’s import the required libraries. For this task, you will need “sklearn.preprocessing” which is a library that contains most of the preprocessing functions that you may need in your projects. Additionally, let’s set up a simple example

from sklearn.preprocessing import StandardScaler
import numpy as np

 

Next, let’s initialize the scaler and apply it to the feature using the fit_transform method.

# Set up sample data
data = np.array([[-1000.5], 
              [-82.1], 
              [0], 
              [100], 
              [900.9]])


# initialize the scaler
scaler = StandardScaler()

# Apply the transormation
standardized = scaler.fit_transform(data)

 

And that is it you can print out your standardized values

standardized
array([[-1.62764418],
       [-0.10875659],
       [ 0.02702376],
       [ 0.19240786],
       [ 1.51696914]])

 

Remember, standardization works best if the underlying feature (variable/column) data assumes a normal (gaussian) distribution. It is quite useful if you are using linear models such as logistic regression, linear regression, discriminant analysis.

 

Why do we need to standardize your dataset?

Standardization is a very common task that you will be performing as a data analyst. A lot of models require your data to be rescaled if you want them to perform at their best. Data standardization is the process of transforming said data into a uniform format to make it easier to analyze and may improve your model performance.

 

The idea behind it is simple. Let say we use age and salary as variables for our model. The range of values for the age would vary between 18-60, while the range of values for salary could vary between $20k-$250k. So these huge differences in values could affect the model performances of value-sensitive models (PCA, KNN, SVM). Another perk of standardizing is that it may make your model run slightly faster since you are using smaller values than the original.

 

To accurately analyze the data, it is important to be aware of the variability among variables that are measured at different scales. If we don’t do so, we might be at risk of creating bias in our analysis. Hence the need for data standardization.

 

If you made this far in the article, thank you very much.

 

I hope this information was of use to you. 

 

Feel free to use any information from this page. I’d appreciate it if you can simply link to this article as the source. If you have any additional questions, you can reach out to malick@malicksarr.com  or message me on Twitter. If you want more content like this, join my email list to receive the latest articles. I promise I do not spam. 

 

Newsletter

 

 

 

 

Leave a Comment