I'm learning sklearn and I didn't understand very good the difference and why use 4 outputs with the function train_test_split()
.
In the Documentation, I found some examples but it wasn't sufficient to end my doubts.
Does the code use the X_train
to predict the X_test
or use the X_train
to predict the y_test
?
What is the difference between train and test? Do I use train to predict the test or something similar?
I'm very confused about it. I will let below the example provided in the Documentation.
>>> import numpy as np
>>> from sklearn.model_selection import train_test_split
>>> X, y = np.arange(10).reshape((5, 2)), range(5)
>>> X
array([[0, 1],
[2, 3],
[4, 5],
[6, 7],
[8, 9]])
>>> list(y)
[0, 1, 2, 3, 4]
>>> X_train, X_test, y_train, y_test = train_test_split(
... X, y, test_size=0.33, random_state=42)
...
>>> X_train
array([[4, 5],
[0, 1],
[6, 7]])
>>> y_train
[2, 0, 3]
>>> X_test
array([[2, 3],
[8, 9]])
>>> y_test
[1, 4]
>>> train_test_split(y, shuffle=False)
[[0, 1, 2], [3, 4]]
Below is a dummy pandas.DataFrame
for example:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
df = pd.DataFrame({'X1':[100,120,140,200,230,400,500,540,600,625],
'X2':[14,15,22,24,23,31,33,35,40,40],
'Y':[0,0,0,0,1,1,1,1,1,1]})
Here we have 3 columns, X1,X2,Y
suppose X1 & X2
are your independent variables and 'Y'
column is your dependent variable.
X = df[['X1','X2']]
y = df['Y']
With sklearn.model_selection.train_test_split
you are creating 4 portions of data which will be used for fitting & predicting values.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4,random_state=42)
X_train, X_test, y_train, y_test
Now
1). X_train - This includes your all independent variables,these will be used to train the model, also as we have specified the test_size = 0.4
, this means 60%
of observations from your complete data will be used to train/fit the model and rest 40%
will be used to test the model.
2). X_test - This is remaining 40%
portion of the independent variables from the data which will not be used in the training phase and will be used to make predictions to test the accuracy of the model.
3). y_train - This is your dependent variable which needs to be predicted by this model, this includes category labels against your independent variables, we need to specify our dependent variable while training/fitting the model.
4). y_test - This data has category labels for your test data, these labels will be used to test the accuracy between actual and predicted categories.
Now you can fit a model on this data, let's fit sklearn.linear_model.LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, y_train) #This is where the training is taking place
y_pred_logreg = logreg.predict(X_test) #Making predictions to test the model on test data
print('Logistic Regression Train accuracy %s' % logreg.score(X_train, y_train)) #Train accuracy
#Logistic Regression Train accuracy 0.8333333333333334
print('Logistic Regression Test accuracy %s' % accuracy_score(y_pred_logreg, y_test)) #Test accuracy
#Logistic Regression Test accuracy 0.5
print(confusion_matrix(y_test, y_pred_logreg)) #Confusion matrix
print(classification_report(y_test, y_pred_logreg)) #Classification Report
You can read more about metrics here
Read more about data split here
Hope this helps:)