Splitting data into train, test, and validation sets is a repetitive task. You will need to perform the split every time you run your machine learning models. In this article, I will provide you quick code snippet on how to do the splitting and give you insight into why we split data into train, test, and validation sets.
How to split a dataset to train, test, and validation sets with SK Learn?
The first method is the most common one. It makes use of the train_test_split function from the SK Learn package.
Import the libraries
For this split, we will be using pandas and sklearn
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
Load a sample data set
We will be using the Iris Dataset.
# load the iris dataset and get X and Y data
iris = load_iris()
train = pd.DataFrame(iris.data)
test = pd.DataFrame(iris.target)
Split the dataset
We can use the train_test_split to first make the split on the original dataset. Then, to get the validation set, we can apply the same function to the train set to get the validation set.
In the function below, the test set size is the ratio of the original data we want to use as the test set. The shuffle function randomly changes the order of the various rows. Finally, the random_state initializes the seed for the random function used to split the dataset. Setting the random state allows the experiment to be easily reproduced and ensures results within the same parameters.
# set aside 20% of train and test data for evaluation
X_train, X_test, y_train, y_test = train_test_split(train, test,
test_size=0.2, shuffle = True, random_state = 8)
# Use the same function above for the validation set
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train,
test_size=0.25, random_state= 8) # 0.25 x 0.8 = 0.2
print("X_train shape: {}".format(X_train.shape))
print("X_test shape: {}".format(X_test.shape))
print("y_train shape: {}".format(y_train.shape))
print("y_test shape: {}".format(y_test.shape))
print("X_val shape: {}".format(y_train.shape))
print("y val shape: {}".format(y_test.shape))
X_train shape: (90, 4)
X_test shape: (30, 4)
X_val shape: (30, 4)
y_train shape: (90, 1)
y_test shape: (30, 1)
y val shape: (30, 1)
How to split a dataset to train, test, and validation sets with Numpy?
You do not always need to use sklearn to split your dataset. You can use NumPy as well. Here is how it works. Let us import the NumPy library and use the np split function to create the split.
Data import & split
Similarly, here’s another way to import the data set.
iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/'
'master/iris.csv')
# one line split
train, validation, test = np.split(iris.sample(frac=1), [int(.6*len(iris)),
int(.8*len(iris))])
Data separation to X_train, y_train, X_test, y_test and X_val, y_val
The previous import had already the train and test separated. In this one, however, we will do the data separation ourselves. All you have to do next is to separate your X_train, y_train etc. For this dataset, the target variable is the last column, and the features are the first 4. Remember to use the code snippet below only if your dataset has the same configuration. Else, you will have to name the target feature for your y_{} and drop the target feature from your x.
# Assign the train split
X_train = train[[train.columns[i] for i in range(train.shape[1]-1) ]]
y_train = train[train.columns[-1]]
# Assign the test split
X_test = test[[test.columns[i] for i in range(train.shape[1]-1) ]]
y_test = test[test.columns[-1]]
# Assign the validation split
X_val = validation[[validation.columns[i] for i in
range(validation.shape[1]-1) ]]
y_val = validation[validation.columns[-1]]
# Print the sets data shapes
print("X_train shape: {}".format(X_train.shape))
print("X_test shape: {}".format(X_test.shape))
print("y_train shape: {}".format(y_train.shape))
print("y_test shape: {}".format(y_test.shape))
print("X_val shape: {}".format(X_val.shape))
print("y_val shape: {}".format(y_val.shape))
X_train shape: (90, 4)
X_test shape: (30, 4)
X_val shape: (30, 4)
y_train shape: (90, 1)
y_test shape: (30, 1)
y val shape: (30, 1)
If you made this far in the article, thank you very much.
I hope this information was of use to you.
Feel free to use any information from this page. I’d appreciate it if you can simply link to this article as the source. If you have any additional questions, you can reach out to malick@malicksarr.com or message me on Twitter. If you want more content like this, join my email list to receive the latest articles. I promise I do not spam.
If you liked this article, maybe you will like these too.
A Gentle Introduction to Data Science Presentation (Storytelling)
Are Data Science Jobs in Demand?
Why is Machine Learning important? [in 2021]
Why Data Science is Important?
At here i think it is better to use “label” instead of test :
train = pd.DataFrame(iris.data)
test = pd.DataFrame(iris.target)