How to use Scikit learn in a Machine Learning Project for Beginner? [Sklearn Tutorial]

Sklearn or Sci Kit Learn is one of the most popular and most used libraries for machine learning. This article will attempt to give you a sklearn tutorial by running a quick, easy and basic machine learning project. 

 

Sk-learn is an excellent library containing machine learning, data preprocessing, data transformation algorithms/tools implemented in python. Using a basic machine learning project, I will showcase an overview of how to use some of those tools in this package.

 

Importing data using Sklearn (Scikit Learn tutorial)

The first step of any machine learning problem is importing a dataset. In that regard, sklearn has a bunch of practice datasets available within the library. Those datasets are processed in a way that allows you to practice different machine learning techniques that vary from regression, classification, clustering, image classification, natural language processing, etc. 

 

For our Scikit learn tutorial, let’s import the Boston dataset, a famous dataset used for regression. This dataset has 13 attributes (columns) that should help predict the prices of houses in the city of Boston. This dataset is a good start for you if you plan to apply data science/machine learning techniques in Real Estate. To import the Boston dataset, we will use the method load_boston

 

import numpy as np
from sklearn.datasets import load_boston
# Import the boston dataset from sklearn
boston = load_boston()
X,y = boston.data, boston.target
print('DATASET SIZE:\n',
      X.shape, 'training samples \n', 
      y.shape, 'test samples' )
DATASET SIZE:
 (506, 13) training samples 
 (506,) test samples

 

In the code snippet below, we will import the dataset and save it as Xx and yy, where Xx are the attributes (or features or columns) and Yy is the target variable (house price). 

 

Note: The type of data Sklearn returns is organized as dictionaries, not dataframe.

import pandas as pd

# Create the DataFrame and 
# assign the feature names as columns names
df=pd.DataFrame(X, columns=boston.feature_names)
df.describe()

CRIM	ZN	INDUS	CHAS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT
count	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000
mean	3.613524	11.363636	11.136779	0.069170	0.554695	6.284634	68.574901	3.795043	9.549407	408.237154	18.455534	356.674032	12.653063
std	8.601545	23.322453	6.860353	0.253994	0.115878	0.702617	28.148861	2.105710	8.707259	168.537116	2.164946	91.294864	7.141062
min	0.006320	0.000000	0.460000	0.000000	0.385000	3.561000	2.900000	1.129600	1.000000	187.000000	12.600000	0.320000	1.730000
25%	0.082045	0.000000	5.190000	0.000000	0.449000	5.885500	45.025000	2.100175	4.000000	279.000000	17.400000	375.377500	6.950000
50%	0.256510	0.000000	9.690000	0.000000	0.538000	6.208500	77.500000	3.207450	5.000000	330.000000	19.050000	391.440000	11.360000
75%	3.677083	12.500000	18.100000	0.000000	0.624000	6.623500	94.075000	5.188425	24.000000	666.000000	20.200000	396.225000	16.955000
max	88.976200	100.000000	27.740000	1.000000	0.871000	8.780000	100.000000	12.126500	24.000000	711.000000	22.000000	396.900000	37.970000

 

With the above snippet, we have an idea of how much data we are working with and its shape. We can take understand the dataset a bit further by extracting the descriptive statistics of its features. Examples of descriptive stats are mean, std, min, max, etc. To extract the descriptive statistic from the current dataset, we first need to convert it to a dataframe. After the conversion, we will run the pandas built-in describe() method to extract the general stats.

 

Looking at the stats, we can clearly see that the various features (aka columns, variables) have different scales, so we may need to normalize them.

 

Splitting data to Train/Test sets with Scikit learn tutorial

 Splitting data into train and test is the next natural step in any machine learning project, including this Scikit learn tutorial. On the one hand, you will create a “sub-dataset” that your machine learning model will use to learn and generalize its shape to make a prediction. This dataset is called the “train” set. On the other hand, we will create a test dataset to evaluate the performance of your model. To split the dataset into train/test set, we will use the sklearn method train_test_split. I have written a quick guide on splitting your dataset to train, test, and validate available here.

 

from sklearn.model_selection import train_test_split
# Set the size (25% of test)
test_set_size=0.25
# Do the split test, shuffling means reaordering the rows at random
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_set_size,
                                                    shuffle=True, random_state=8)
print('Size of the training set:', X_train.shape[0])
print('Size of the test set:', X_test.shape[0])
Size of the training set: 379
Size of the test set: 127

 

Standardize your data with sklearn

It is usually a good habit to always standardize (normalize or scale) your dataset before running a model. In fact, it is mandatory when the data varies in different scales. Even so, it is most of the time a good choice to normalize/scale your dataset. In our Scikit learn tutorial, we can see from the quick exploration a lot of variability in the scales of our dataset. Ergo, standardization is mandatory.

 

The foremost reason why standardization is important is that, more often than not, machine learning algorithms are not scale-invariant, even the one in this Scikit learn tutorial. Ergo variables with different scales affect the update rule differently. For example, standardizing the data is essential in gradient descent.

 

\theta_j := \theta_j - \alpha \frac{1}{m} \sum^{m}_{i=1}(h_{\theta}(x^{(i)}) - y^{(i)})x^{(i)}_{j}

 

 Looking at the gradient descent formula, the x affects the updating step size. Consequently, having data from different scales will create different step sizes for each variable/feature/column. So, standardizing the data allows all the features to moves smoothly towards the global minimum at the same rate. Furthermore, we can see that any machine learning algorithm that uses gradient descent as an optimization method (Linear Models, Neural Networks, CNN, SVM, etc.) could benefit from standardized data. To scale/standardize the data in our project, we can use the StandardScaler or MinMaxScaler.

 

As a refresher, here is a reminder on how to scale and standardize your dataset.

 

For every element x_i:

 

Standardiztion : \frac{X_i - \mu_i}{\sigma_i} Scaling: \frac{x_i - M_i}{M_i-m_i}

 

Where

M_i = max(x_i) m_i = min(x_i)

 

 

Note: Standardization does not guarantee a global minimum; it just speeds up the convergence of gradient descent. 

 

Additionally, when scaling or standardizing data, we tend to fit the values the mean (mu) and standard deviation (sigma) on the train set before applying the transformation to the test set. Here’s why:

  1. Applying the transformation directly to the test set would be cheating since doing it the other way entails adding information from the test set to your training data.
  2. We assume that the test set is NEVER available during the training phase. Essentially, you should not touch the test set until the training is done.

(It seems a bit obvious, but I’ve seen many perfect predictions because one accidentally touches the test set)

 

from sklearn.preprocessing import StandardScaler

# Initialize the scaler
scaler = StandardScaler()

# Apply the scaler to the train test
X_train = scaler.fit_transform(X_train) 

# Apply the scaler to the test set
X_test = scaler.transform(X_test) 

# To Standardize the target value, you can use the below (uncomment)
# y_train = scaler.fit_transform(y_train.reshape(-1, 1)).flatten() 
# y_test = scaler.transform(y_test.reshape(-1, 1)).flatten() 

 

Note that there is a difference between fit_transform and transform:

  • fit_transform first compute the mean () and the standard deviation and then implements the transformation on the train set
  • On the other hand, transform standardizes the dataset using the mean and the standard deviation computed on the training set by the fit_transform function.

 

Applying linear regression to our dataset using Scikit learn (tutorial)

The data is now ready. Therefore, in this Scikit learn tutorial, we can apply models to it. For a start, let’s fit the most straightforward model, which is a linear regression. You can think of regression as a “line of best fit.” In other words, it is the line that minimizes the mean squared error.

 

Consequently, it is why it is sometimes called the “least squares line” or regression line. Anyways, I am sure you have heard of linear regression before. We can easily apply linear regression to our dataset using the LinearRegression method.

 

from sklearn.linear_model import LinearRegression

# Initizlize the LR model
LR = LinearRegression()
# Fit the regression model
LR.fit(X_train, y_train

 

We can check the performance of our model by using the Means Square Error (MSE) and the R2 score. As a reminder, here’s how you can calculate the mean square error or the R2 score.

 

MSE = \frac{1}{P}\lVert y_{true}- \bar{y}_{pred}\rVert^2 R^2= 1 - \frac{\lVert y_{true}- \bar{y}_{pred}\rVert^2}{\lVert y_{true} - \bar{y}\rVert^2}

 

Where

\bar{y} = \frac{1}{P}\sum^P_{i=1}y_i

and

\bar{y}_{pred}

the predictions of the model

 

Sk-learn already offers the implementation of the MSE and the R2 score, so no need to code those formulas this time.

 

from sklearn.metrics import mean_squared_error as mse

#Training Error
LR_Train_Pred = LR.predict(X_train) 
LR_Train_Error= mse(y_train, LR_Train_Pred)
print('Training Error:',LR_Train_Error)

#Test Error
LR_Test_Pred =  LR.predict(X_test)
LR_Test_Error= mse(y_test, LR_Test_Pred)
print('Test Error:', LR_Test_Error)

#evaluate the R^2
LR_R2_Score = LR.score(X_train, y_train) #returns the R^2 score
print('R^2 linear regression:',LR_R2_Score)
Training Error: 21.886636808836805
Test Error: 22.684244438022347
R^2  linear regression: 0.7444819832106442

 

If you look at the results, we have an r2 score <.9, which is not looking great for linear regression. With hyperparameters tuning, we may get the R2 score up by maybe +0.05. But, generally, linear regression may not be the best model for this dataset. Let us try with a Neural Network.

 

Multi-Layer Perceptron for Regression using Scikit learn (tutorial)

Since a linear model was too simple to capture the shape of the data, we can use a Multi-Layer Perceptron to solve this regression problem. To do so, we will use the MLPRegressor method from the sklearn library.

 

from sklearn.neural_network import MLPRegressor

# Initialize the MLP and their hyperparameters
MLP=MLPRegressor( hidden_layer_sizes=117, activation='tanh', solver='lbfgs', 
                 alpha=1e-3, random_state=8, max_iter=10000)

#fit the training data
MLP.fit(X_train,y_train)

 

Initializing your model is as simple as that! Now let’s compare the performance of a Multi-Layer Perceptron compared to that of linear regression.

 

#evaluate the test error
MLP_Train_Pred =  MLP.predict(X_train)
MLP_Train_Error= mse(y_train, MLP_Train_Pred)
print('Training Error:', MLP_Train_Error)

MLP_Test_Pred = MLP.predict(X_test)
MLP_Test_Error =  mse(y_test, MLP_Test_Pred)
print('Test Error:', MLP_Test_Error)

# r2 score
MLP_R2_Score =MLP.score(X_train, y_train) #returns the R^2 score
print('R^2 MLP:',MLP_R2_Score)
Training Error: 7.150314984488708e-06
Test Error: 18.99646097809624
R^2 MLP: 0.999999916522839

 

Fantastic! MLP seems to be performing almost flawlessly. As seen in the results, the MLP’s Training error and the Test error are smaller than the errors in the regression. 

 

The result seems to look great. You can send this to production, and you will be the office hero! You may think so. However, there is a pretty fundamental “mistake” (not necessarily a mistake) in the test results. Indeed, the difference between the train and test error is relatively high. In other words, when we trained the data, the model learned it so well that it will predict it so well (sound score).

 

However, when we introduced the new points in the test set, it missed a lot more cases during the prediction. Ergo, when a model has a great score in the train set but a much different score in the test set, this is a sign of overfitting a model. It means that the model failed to learn the datapoint so well that it was unable to generalize. Check the figure below for an illustration of overfitting a model.

 

Scikit learn tutorial - Overfitting

 

What may have caused overfitting? Arbitrally fixing the number of neurons to 117 could be a reason, or setting the learning rate to 0.001 or not setting an early stoppage. For various reasons, one way to remedy this situation can be through hyperparameter tuning.

 

Hyperparameter tuning with Gridsearch in Sklearn

Hyperparameters are very important regarding the performance of an MLP. If you don’t know what they are, they are the settings that you put in when initializing a model (in our case, hidden layer). You can find the list of all possible hyperparameter for and Multi-Layer Perceptron here.  

 

To find the best possible hyperparameter configuration, in this Scikit learn tutorial, we can use the grid-search package again from sci-kit learn (sklearn). To perform GridSearch using Sklearn, you will need to provide the “grid,” which are the parameters you want to test out, and the K-Fold cross-validation (#of times to go over). 

 

from sklearn.model_selection import GridSearchCV

# Create the model to be tested, leave out the pameter to be tested
grid_search_MLP=MLPRegressor( activation='tanh', solver='lbfgs', alpha=0.001,
                             random_state=8, max_iter=10000)

# Create as dictionary the model parameters to be tested
max_neurons=150 #maximum number of neurons
# Dictionnary containing list that will iterate from 70 to 150 with a 5 step
params={'hidden_layer_sizes':np.arange(70,max_neurons+1,5)} 
print(params)

#choose the value for the K-fold Crossvalidation
CV=3
print("This model will be running with a \n")
print(str(CV)+'-fold crossvalidation')
{'hidden_layer_sizes': array([ 70,  75,  80,  85,  90,  95, 100, 105, 110, 115, 120, 125, 130,
       135, 140, 145, 150])}
3-fold crossvalidation

 

Now we can add the grid and run it.

grid = GridSearchCV(grid_search_MLP, params, scoring='neg_mean_squared_error', cv=CV, 
                    n_jobs=-1, return_train_score=True, verbose =1)

grid.fit(X_train, y_train);
Fitting 3 folds for each of 17 candidates, totalling 51 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:   15.7s
[Parallel(n_jobs=-1)]: Done  51 out of  51 | elapsed:   32.9s finished

 

One thing to remember, Throughout this project, we have used the MSE as our scoring metric. You can use any metric you want that’s appropriate for your project as long as it helps you better understand your running model. You can find a list of metrics here.

 

Additionally, in some cases, GridSearch can be time-consuming. You can run Gridseach in parallel using multiple processors within your computer by setting the number of processors to use in n_job. Setting n_job = -1 will make use of all your processors.

 

Let’s check the results now. 

print('Best Results are:',grid.best_params_)
Best Network is: {'hidden_layer_sizes': 100}

 

Sometimes (usually when things do not go your way), you will want to check how your model performed in the different folds. To do so, print out the results using this command. 

 

grid.cv_results_

 

Let’s visualize the performance of the train and test error.

import matplotlib.pyplot as plt
x_axis=parameters['hidden_layer_sizes']
plt.plot(x_axis,-grid.cv_results_['mean_test_score'],label='Test Error',
         color = "red")
plt.plot(x_axis,-grid.cv_results_['mean_train_score'], label='Train Error')
plt.legend()
plt.ylabel('Objective function value')
plt.xlabel('Number of neurons')
plt.show()
best_index=grid.best_index_+1

Sklearn Tutorial- Machine Learning Project - Results

 

As you can see above, as the number of neurons increases, the test error kept on zig-zaging while the train error kept on decreasing (if you print solely the test error you may visualize it). That means that during training, the model kept on learning better and better as the number of neurons increased. However, the test error did not have the same shape. The best number of neurons was the lowest at the neuron at 100 (the lowest point in the red curve), which provided the best test error (we care more about how the model performs on unseen data, aka test error). The graph illustrates is the reasoning on why GridSearch selected seven as the best result.

 

Finally, let’s train our model again using the best configuration of hyperparameter found during GridSeach.

 

MLP_BT=grid.best_estimator_ 
MLP_BT.fit(X_train,y_train)

# Print Train Error
MLP_Best_Train_Pred = MLP_BT.predict(X_train)
MLB_Best_Train_Error= mse(y_train, MLP_Best_Train_Pred)
print('Training Error:',MLB_Best_Train_Error)

# Print Test Error
MLP_Best_Test_Pred = MLP_BT.predict(X_test)
MLB_Best_Test_Error= mse(y_test, MLP_Best_Test_Pred)
print('Test Error:', MLB_Best_Test_Error)

# Print R2 Score
MLP_Best_R2_Score=MLP_best.score(X_train, y_train) #returns the R^2 score
print('R^2 returned by the MLP:',MLP_Best_R2_Score)
Training Error: 1.1548515652622358e-05
Test Error: 15.018534198677754
R^2 returned by the MLP: 0.9999998651755478

 

Now that we have the results let’s compare all the models we have now.

 

print('Training Error Comparaison:\n LR %2.3f MLP %2.2f MLP_BEST %2.3f' 
      % (LR_Train_Error,MLP_Train_Error,MLB_Best_Train_Error))
print('Test Error Comparaison:\n LR %2.3f MLP %2.3f MLP_BEST %2.3f' 
      % (LR_Test_Error,MLP_Test_Error,MLB_Best_Test_Error))
Training Error Comparaison:
 LR 21.887 MLP 0.00 MLP_BEST 0.2341
Test Error Comparaison:
 LR 22.684 MLP 18.996 MLP_BEST 15.019

 

As you can see, the training error during the MLP best run is higher than the default one. However, the Best MLP model can generalize better than the different models on never-seen data. The difference between train and test in the best slightly decreased as well. Now, this is the abstract part of machine learning and Data Science. 

 

Do you believe that those results are good enough? Do you want to improve that model on test data? You will have a minimum threshold in real-life cases that you have to reach to make your code good enough to be released (e.g., accuracy threshold =80%)? There is always something to do; you can try it for yourself. You can have even better results by testing other hyperparameter and checking how they affect the data. Furthermore, you could test out different models and see if they perform better than the MLP.

 

To wrap up.

This Scikit learn tutorial is the most basic and somewhat mechanical machine learning beginner project you could do. Scikit Learn (Sklearn) provides a deep toolset for your various data science projects. The above is a straightforward started project, and in the end, you can use the same sklearn python code snippet to test out different models or improve and optimize the current model you are having. The modeling phase (Hyperparameter tuning) is when your creativity and your model understandings (and math skills) come into play for you to make better models. 

 

Let me know if you have any questions, and if you are new to ML, use the principles in this article to practice your machine learning skills.

 

If you made this far in the article, thank you very much.

 

I hope this information was of use to you. 

Feel free to use any information from this page. I’d appreciate it if you can simply link to this article as the source. If you have any additional questions, you can reach out to malick@malicksarr.com  or message me on Twitter. If you want more content like this, join my email list to receive the latest articles. I promise I do not spam. 

 

Newsletter

 

Leave a Comment