pythonmachine-learningscikit-learnsklearn-pandas

How to choose data columns and target columns in a dataframe for test_train_split?


I'm trying to set up a test_train_split with data I have read from a csv into a pandas dataframe. The book I am reading says I should separate into x_train as the data and y_train as the target, but how can I define which column is the target and which columns are the data? So far i have the following

import pandas as pd
from sklearn.model_selection import train_test_split
Data = pd.read_csv("Data.csv")

I have read to do the split in the following way however the following was using a bunch where the data and target were already defined:

X_train, X_test, y_train, y_test = train_test_split(businessleisure_data['data'],
                                                    iris_dataset['target'], random_state=0)

Solution

  • You can do like this:

    Data = pd.read_csv("Data.csv")    
    X = Data.drop(['name of the target column'],axis=1).values
    y = Data['name of the target column'].values
    X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=0)
    

    In most cases, the target variable is the last column of the data set so you can also try this:

    Data = pd.read_csv("Data.csv")
    X = Data.iloc[:,:-1]
    y = Data.iloc[:,-1]
    X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=0)