How To Standardize Your Data ? [Data Standardization With Python]

Data standardization is a preprocessing step that you will be performing before running your model. Indeed, in some projects, you may improve the performance of your models by standardizing some of your features. In this article, I will be showing you how to standardize your data using Python.

What is data standardization?

The idea behind standardization is to rescale your feature (column or variable) to have a mean of 0 and a unit variance. It is usually represented by the formula, $Z = \frac{x_{i} - \mu}{s}$ where:

– Z is the standardized value

– U is the mean of the training samples. U can be equal to zero if you set with_mean =False

– S is the standard deviation of the training sample. S can be set to 1 if you call with_std=False

How to standardize your data with Python

Standardization is a simple task to perform in Python. You will achieve it in a couple of lines of code.

First, let’s import the required libraries. For this task, you will need “sklearn.preprocessing” which is a library that contains most of the preprocessing functions that you may need in your projects. Additionally, let’s set up a simple example

from sklearn.preprocessing import StandardScaler
import numpy as np

Next, let’s initialize the scaler and apply it to the feature using the fit_transform method.

# Set up sample data
data = np.array([[-1000.5], 
              [-82.1], 
              [0], 
              [100], 
              [900.9]])


# initialize the scaler
scaler = StandardScaler()

# Apply the transormation
standardized = scaler.fit_transform(data)

And that is it you can print out your standardized values

standardized

array([[-1.62764418],
       [-0.10875659],
       [ 0.02702376],
       [ 0.19240786],
       [ 1.51696914]])

Remember, standardization works best if the underlying feature (variable/column) data assumes a normal (gaussian) distribution. It is quite useful if you are using linear models such as logistic regression, linear regression, discriminant analysis.

Why do we need to standardize your dataset?

Standardization is a very common task that you will be performing as a data analyst. A lot of models require your data to be rescaled if you want them to perform at their best. Data standardization is the process of transforming said data into a uniform format to make it easier to analyze and may improve your model performance.

The idea behind it is simple. Let say we use age and salary as variables for our model. The range of values for the age would vary between 18-60, while the range of values for salary could vary between $20k-$250k. So these huge differences in values could affect the model performances of value-sensitive models (PCA, KNN, SVM). Another perk of standardizing is that it may make your model run slightly faster since you are using smaller values than the original.

To accurately analyze the data, it is important to be aware of the variability among variables that are measured at different scales. If we don’t do so, we might be at risk of creating bias in our analysis. Hence the need for data standardization.

If you made this far in the article, thank you very much.

I hope this information was of use to you.

Feel free to use any information from this page. I’d appreciate it if you can simply link to this article as the source. If you have any additional questions, you can reach out to malick@malicksarr.com or message me on Twitter. If you want more content like this, join my email list to receive the latest articles. I promise I do not spam.

If you liked this article, maybe you will like these too.

Split Dataset in Train, Test and Validation Sets

Hyperparameter Tuning with Random Search

Hyperparameter Tuning with Grid Search

How to create a practice dataset?

Machine Learning project for Beginners

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

What is data standardization?

How to standardize your data with Python

Why do we need to standardize your dataset?

If you made this far in the article, thank you very much.

I hope this information was of use to you.

Newsletter

Leave a Comment Cancel reply

How to Standardize your Data ? [Data Standardization with Python]

What is data standardization?

How to standardize your data with Python

Why do we need to standardize your dataset?

If you made this far in the article, thank you very much.

I hope this information was of use to you.

Newsletter

Leave a Comment Cancel reply