pythonpandasrandomsamplesample-data

Can i sample sets of data within a dataframe without selecting the same set twice (without replacement)?


I am fairly new to python and i would like to sample sets of data in the following dataframe by their group, without selecting the same group twice. The code i have written does sample the sets of data correctly, however, it can select the same set twice.

please note: the following data is testing data and the actual data i am using the code on is much larger in size and therefore using indexes will not be possible.

DATA:

d={'group': ['A','A','A','B','B','B','C','C','C','D','D','D','E','E','E'], 'number': [1,2,3,1,2,3,1,2,3,1,2,3,1,2,3],'weather':['hot','hot','hot','cold','cold','cold','hot','hot','hot','cold','cold','cold','hot','hot','hot']}```
df = pd.DataFrame(data=d)
df
group   number  weather
A       1       hot
A       2       hot
A       3       hot
B       1       cold
B       2       cold
B       3       cold
C       1       hot
C       2       hot
C       3       hot
D       1       cold
D       2       cold
D       3       cold
E       1       hot
E       2       hot
E       3       hot

MY CODE

df_s=[]
for typ in df.group.sample(3,replace=False):
    df_s.append(df[df['group']==typ])
df_s=pd.concat(df_s)
df_s

OUTCOME

group   number  weather
E       1       hot
E       2       hot
E       3       hot
E       1       hot
E       2       hot
E       3       hot
D       1       cold
D       2       cold
D       3       cold

The outcome should give 3 different groups data however as can be seen there is only 2 (E & D) meaning the code can select the same group more than once.


Solution

  • Method sample used with argument replace=False will ensure, that you have no row duplicates in created sample df. However you do have several rows with the same letter denoting group (your column group).

    For just quickfixing your code:

    df_s=[]
    for typ in np.random.choice(df["group"].unique(), 3, replace=False):
        df_s.append(df[df['group']==typ])
    df_s=pd.concat(df_s)
    df_s