Given the following df:
data = {'Org': ['Tom', 'Kelly', 'Rick', 'Dave','Sara','Liz'],
'sum': [3, 4, 4, 4, 5, 5]}
df = pd.DataFrame(data)
# Org sum
# 0 Tom 3
# 1 Kelly 4
# 2 Rick 4
# 3 Dave 4
# 4 Sara 5
# 5 Liz 5
I want to shuffle only the duplicate values in that column and keep the sorted order.
Output should look like this:
data = {'Org': ['Tom', 'Rick', 'Dave', 'Kelly','Liz','Sara'],
'sum': [3, 4, 4, 4, 5, 5]}
df = pd.DataFrame(data)
# Org sum
# 0 Tom 3
# 1 Rick 4
# 2 Dave 4
# 3 Kelly 4
# 4 Liz 5
# 5 Sara 5
with df.sample(frac=1)
it will shuffle all rows, but that is not what I like to achieve.
If your groups are contiguous, and you want to keep the relative order, use groupby.sample
:
out = df.groupby('sum', sort=False).sample(frac=1)
Example output:
Org sum
0 Tom 3
3 Dave 4
1 Kelly 4
2 Rick 4
5 Liz 5
4 Sara 5
If you wand the output sorted by sum, then:
out = df.groupby('sum', sort=False).sample(frac=1)
# or
out = df.sample(frac=1).sort_values(by='sum', kind='stable')
which will ensure that the groups are sorted, even if they are not sorted in the input.
Conversely, if you want to leave the original order of the groups fully intact but want to still shuffle within a group, like in this example:
Org sum
0 Tom 3
1 Kelly 4
2 Rick 4
3 Sara 5
4 Liz 5
5 Dave 4 # this is part of group "4" but we want the row to stay there
Then use groupby.transform
to shuffle the indices in place, then reindex:
out = df.loc[df.groupby('sum', sort=False)['sum']
.transform(lambda g: g.sample(frac=1).index)]
Example output:
Org sum
0 Tom 3
2 Rick 4
5 Dave 4
4 Liz 5
3 Sara 5
1 Kelly 4 # the group was shuffled, not the absolute position