[SOLVED] Randomly Select Columns to Shuffle of a Two-Dimensional Dataframe

Randomly Select Columns to Shuffle of a Two-Dimensional Dataframe

I would like to randomly select a few columns of a 2 dimensional dataframe, and shuffle the values within those columns. I can easily shuffle all values (column-wise) of the dataframe, but I am looking to only do so to a randomly selected few.

For instance, take the 6x6 dataframe below:


      0    1     2     3     4     5
0     5    3     7     1     2     9
1     1    7     5     3     0     8
2     0    2     7     1     6     5
3     8    4     2     1     9     7
4     2    9     5     6     3     4
5     7    5     8     2     1     0

Randomly selecting a few of the 6 columns, note the following output:

      0    1     2     3     4     5
0     2    9     7     1     2     4
1     5    7     5     3     0     0
2     7    2     7     1     6     5
3     8    3     2     1     9     7
4     1    5     5     6     3     9
5     0    4     8     2     1     8

The above shows the 1st, 2nd and last column shuffled, and all others remain as is.

The following code allows me to shuffle all columns:

import numpy as np
df = np.random.random((6,6))
np.random.random(df)

And, yet, after many attempts, I have been unable to modify this to only select (randomly) a few columns. Any advice will be greatly appreciated. Thank you.

Solution

Assuming this input example:

import numpy as np
df = pd.DataFrame(np.arange(4*5).reshape(4, 5, order='F'))

   0  1   2   3   4
0  0  4   8  12  16
1  1  5   9  13  17
2  2  6  10  14  18
3  3  7  11  15  19

I would use:

import numpy as np

# random number of columns
n = np.random.randint(0, df.shape[1])

# pick n random columns
cols = np.random.choice(df.columns, 3, replace=False)

# shuffle them independently
df[cols] = df[cols].apply(lambda s: np.random.choice(s, len(s), replace=False))

You can even vectorize the last step with permuted if efficiency is important:

rng = np.random.default_rng()

# n = rng.integers(0, df.shape[1])
# cols = rng.choice(df.columns, n, replace=False)

df[cols] = rng.permuted(df[cols], axis=0)

Example output:

   0  1   2   3   4
0  1  4  11  14  16
1  0  5   8  15  17
2  3  6  10  13  18
3  2  7   9  12  19