I am fairly new to python and i would like to sample sets of data in the following dataframe by their group, without selecting the same group twice. The code i have written does sample the sets of data correctly, however, it can select the same set twice.
please note: the following data is testing data and the actual data i am using the code on is much larger in size and therefore using indexes will not be possible.
DATA:
d={'group': ['A','A','A','B','B','B','C','C','C','D','D','D','E','E','E'], 'number': [1,2,3,1,2,3,1,2,3,1,2,3,1,2,3],'weather':['hot','hot','hot','cold','cold','cold','hot','hot','hot','cold','cold','cold','hot','hot','hot']}```
df = pd.DataFrame(data=d)
df
group number weather
A 1 hot
A 2 hot
A 3 hot
B 1 cold
B 2 cold
B 3 cold
C 1 hot
C 2 hot
C 3 hot
D 1 cold
D 2 cold
D 3 cold
E 1 hot
E 2 hot
E 3 hot
MY CODE
df_s=[]
for typ in df.group.sample(3,replace=False):
df_s.append(df[df['group']==typ])
df_s=pd.concat(df_s)
df_s
OUTCOME
group number weather
E 1 hot
E 2 hot
E 3 hot
E 1 hot
E 2 hot
E 3 hot
D 1 cold
D 2 cold
D 3 cold
The outcome should give 3 different groups data however as can be seen there is only 2 (E & D) meaning the code can select the same group more than once.
Method sample
used with argument replace=False
will ensure, that you have no row duplicates in created sample df. However you do have several rows with the same letter denoting group (your column group
).
For just quickfixing your code:
df_s=[]
for typ in np.random.choice(df["group"].unique(), 3, replace=False):
df_s.append(df[df['group']==typ])
df_s=pd.concat(df_s)
df_s