I have the following data:
Group_ID Item_id Target
0 1 1 0
1 1 2 0
2 1 3 1
3 2 4 0
4 2 5 1
5 2 6 1
6 3 7 0
7 4 8 0
8 5 9 0
9 5 10 1
I need to split the dataset into a training and testing set based on the "Group_ID" so that 80% of the data goes into a training set and 20% into a test set.
That is, I need my training set to look something like:
Group_ID Item_id Target
0 1 1 0
1 1 2 0
2 1 3 1
3 2 4 0
4 2 5 1
5 2 6 1
6 3 7 0
7 4 8 0
And test set:
Group_ID Item_id Target
8 5 9 0
9 5 10 1
What would be the simplest way to do this? As far as I know, the standard test_train_split
function in sklearn does not support splitting by groups in a way where I can also indicate the size of the split (e.g. 80/20).
I figured out the answer. This seems to work:
from sklearn.model_selection import GroupShuffleSplit
splitter = GroupShuffleSplit(test_size=.20, n_splits=2, random_state = 7)
split = splitter.split(df, groups=df['Group_Id'])
train_inds, test_inds = next(split)
train = df.iloc[train_inds]
test = df.iloc[test_inds]