python-3.xpandasdataframemachine-learningtrain-test-split

Train_Test_Split based on specific column values


My question is rather close to How to create a train_test_split based on a conditional in python

but I am looking for a better solution.

I have a pandas dataframe where I would typically use the train_test_split function

X_train, X_test, y_train, y_test = train_test_split(data[xvars], data[yvar], train_size=0.98, random_state=42)

However, I would like to split based on my pandas column called week where week < 51 would be train set, and week >= 51 would be test set, how can I achieve this efficiently?

Thanks.


Solution

  • First I sorted the dataframe, and then I apply the solution stated in the doc with shuffle and stratify both set to False.

    The solution to this problem is stated in the doc

    X_train, X_test, y_train, y_test = train_test_split(X,Y, shuffle=False, test_size=0.4, stratify=None)