pythonmachine-learningscikit-learnsklearn-pandassupervised-learning

What is the difference between X_test, X_train, y_test, y_train in sklearn?


I'm learning sklearn and I didn't understand very good the difference and why use 4 outputs with the function train_test_split().

In the Documentation, I found some examples but it wasn't sufficient to end my doubts.

Does the code use the X_train to predict the X_test or use the X_train to predict the y_test?

What is the difference between train and test? Do I use train to predict the test or something similar?

I'm very confused about it. I will let below the example provided in the Documentation.

>>> import numpy as np  
>>> from sklearn.model_selection import train_test_split  
>>> X, y = np.arange(10).reshape((5, 2)), range(5)  
>>> X
array([[0, 1], 
       [2, 3],  
       [4, 5],  
       [6, 7],  
       [8, 9]])  
>>> list(y)  
[0, 1, 2, 3, 4] 
>>> X_train, X_test, y_train, y_test = train_test_split(  
...     X, y, test_size=0.33, random_state=42)  
...  
>>> X_train  
array([[4, 5], 
       [0, 1],  
       [6, 7]])  
>>> y_train  
[2, 0, 3]  
>>> X_test  
array([[2, 3], 
       [8, 9]])  
>>> y_test  
[1, 4]  
>>> train_test_split(y, shuffle=False)  
[[0, 1, 2], [3, 4]]

Solution

  • Below is a dummy pandas.DataFrame for example:

    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
    
    df = pd.DataFrame({'X1':[100,120,140,200,230,400,500,540,600,625],
                           'X2':[14,15,22,24,23,31,33,35,40,40],
                           'Y':[0,0,0,0,1,1,1,1,1,1]})
    

    Here we have 3 columns, X1,X2,Y suppose X1 & X2 are your independent variables and 'Y' column is your dependent variable.

    X = df[['X1','X2']]
    y = df['Y']
    

    With sklearn.model_selection.train_test_split you are creating 4 portions of data which will be used for fitting & predicting values.

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4,random_state=42) 
    
    X_train, X_test, y_train, y_test
    

    Now

    1). X_train - This includes your all independent variables,these will be used to train the model, also as we have specified the test_size = 0.4, this means 60% of observations from your complete data will be used to train/fit the model and rest 40% will be used to test the model.

    2). X_test - This is remaining 40% portion of the independent variables from the data which will not be used in the training phase and will be used to make predictions to test the accuracy of the model.

    3). y_train - This is your dependent variable which needs to be predicted by this model, this includes category labels against your independent variables, we need to specify our dependent variable while training/fitting the model.

    4). y_test - This data has category labels for your test data, these labels will be used to test the accuracy between actual and predicted categories.

    Now you can fit a model on this data, let's fit sklearn.linear_model.LogisticRegression

    logreg = LogisticRegression()
    logreg.fit(X_train, y_train) #This is where the training is taking place
    y_pred_logreg = logreg.predict(X_test) #Making predictions to test the model on test data
    print('Logistic Regression Train accuracy %s' % logreg.score(X_train, y_train)) #Train accuracy
    #Logistic Regression Train accuracy 0.8333333333333334
    print('Logistic Regression Test accuracy %s' % accuracy_score(y_pred_logreg, y_test)) #Test accuracy
    #Logistic Regression Test accuracy 0.5
    print(confusion_matrix(y_test, y_pred_logreg)) #Confusion matrix
    print(classification_report(y_test, y_pred_logreg)) #Classification Report
    

    You can read more about metrics here

    Read more about data split here

    Hope this helps:)