pythonpandasdataframegroup-byshuffle

Pandas shuffle rows within groups in dataframe, leaving the relative groups order intact


Given the following df:

data = {'Org': ['Tom', 'Kelly', 'Rick', 'Dave','Sara','Liz'], 
        'sum': [3, 4, 4, 4, 5, 5]}
df = pd.DataFrame(data)

#      Org  sum
# 0    Tom    3
# 1  Kelly    4
# 2   Rick    4
# 3   Dave    4
# 4   Sara    5
# 5    Liz    5

I want to shuffle only the duplicate values in that column and keep the sorted order.

Output should look like this:

data = {'Org': ['Tom', 'Rick', 'Dave', 'Kelly','Liz','Sara'],
        'sum': [3, 4, 4, 4, 5, 5]}
df = pd.DataFrame(data)

#      Org  sum
# 0    Tom    3
# 1   Rick    4
# 2   Dave    4
# 3  Kelly    4
# 4    Liz    5
# 5   Sara    5

with df.sample(frac=1) it will shuffle all rows, but that is not what I like to achieve.


Solution

  • sorted groups

    If your groups are contiguous, and you want to keep the relative order, use groupby.sample:

    out = df.groupby('sum', sort=False).sample(frac=1)
    

    Example output:

         Org  sum
    0    Tom    3
    3   Dave    4
    1  Kelly    4
    2   Rick    4
    5    Liz    5
    4   Sara    5
    

    If you wand the output sorted by sum, then:

    out = df.groupby('sum', sort=False).sample(frac=1)
    # or
    out = df.sample(frac=1).sort_values(by='sum', kind='stable')
    

    which will ensure that the groups are sorted, even if they are not sorted in the input.

    intact groups

    Conversely, if you want to leave the original order of the groups fully intact but want to still shuffle within a group, like in this example:

         Org  sum
    0    Tom    3
    1  Kelly    4
    2   Rick    4
    3   Sara    5
    4    Liz    5
    5   Dave    4 # this is part of group "4" but we want the row to stay there
    

    Then use groupby.transform to shuffle the indices in place, then reindex:

    out = df.loc[df.groupby('sum', sort=False)['sum']
                   .transform(lambda g: g.sample(frac=1).index)]
    

    Example output:

         Org  sum
    0    Tom    3
    2   Rick    4
    5   Dave    4
    4    Liz    5
    3   Sara    5
    1  Kelly    4 # the group was shuffled, not the absolute position