i have a data frame as below
df = pd.DataFrame({"Col1": ['A','B','B','A','B','B','A','B','A', 'A'],
"Col2" : [-2.21,-9.59,0.16,1.29,-31.92,-24.48,15.23,34.58,24.33,-3.32],
"Col3" : [-0.27,-0.57,0.072,-0.15,-0.21,-2.54,-1.06,1.94,1.83,0.72],
"y" : [-1,1,-1,-1,-1,1,1,1,1,-1]})
Col1 Col2 Col3 y
0 A -2.21 -0.270 -1
1 B -9.59 -0.570 1
2 B 0.16 0.072 -1
3 A 1.29 -0.150 -1
4 B -31.92 -0.210 -1
5 B -24.48 -2.540 1
6 A 15.23 -1.060 1
7 B 34.58 1.940 1
8 A 24.33 1.830 1
9 A -3.32 0.720 -1
Is there a way to split the data frame(60:40 split) such that the first 60% of values each group of Col1
will be train and last 40% test.
Train :
Col1 Col2 Col3 y
0 A -2.21 -0.270 -1
1 B -9.59 -0.570 1
2 B 0.16 0.072 -1
3 A 1.29 -0.150 -1
4 B -31.92 -0.210 -1
6 A 15.23 -1.060 1
Test:
Col1 Col2 Col3 y
5 B -24.48 -2.540 1
7 B 34.58 1.940 1
8 A 24.33 1.830 1
9 A -3.32 0.720 -1
I feel like you need groupby
here
s=df.groupby('Col1').Col1.cumcount()#get the count for each group
s=s//(df.groupby('Col1').Col1.transform('count')*0.6).astype(int)# get the top 60% of each group
Train=df.loc[s==0].copy()
Test=df.drop(Train.index)
Train
Out[118]:
Col1 Col2 Col3 y
0 A -2.21 -0.270 -1
1 B -9.59 -0.570 1
2 B 0.16 0.072 -1
3 A 1.29 -0.150 -1
4 B -31.92 -0.210 -1
6 A 15.23 -1.060 1
Test
Out[119]:
Col1 Col2 Col3 y
5 B -24.48 -2.54 1
7 B 34.58 1.94 1
8 A 24.33 1.83 1
9 A -3.32 0.72 -1