dataframenumpyrandomshuffletype-2-dimension

Randomly Select Columns to Shuffle of a Two-Dimensional Dataframe


I would like to randomly select a few columns of a 2 dimensional dataframe, and shuffle the values within those columns. I can easily shuffle all values (column-wise) of the dataframe, but I am looking to only do so to a randomly selected few.

For instance, take the 6x6 dataframe below:


      0    1     2     3     4     5
0     5    3     7     1     2     9
1     1    7     5     3     0     8
2     0    2     7     1     6     5
3     8    4     2     1     9     7
4     2    9     5     6     3     4
5     7    5     8     2     1     0

Randomly selecting a few of the 6 columns, note the following output:

      0    1     2     3     4     5
0     2    9     7     1     2     4
1     5    7     5     3     0     0
2     7    2     7     1     6     5
3     8    3     2     1     9     7
4     1    5     5     6     3     9
5     0    4     8     2     1     8

The above shows the 1st, 2nd and last column shuffled, and all others remain as is.

The following code allows me to shuffle all columns:

import numpy as np
df = np.random.random((6,6))
np.random.random(df)

And, yet, after many attempts, I have been unable to modify this to only select (randomly) a few columns. Any advice will be greatly appreciated. Thank you.


Solution

  • Assuming this input example:

    import numpy as np
    df = pd.DataFrame(np.arange(4*5).reshape(4, 5, order='F'))
    
       0  1   2   3   4
    0  0  4   8  12  16
    1  1  5   9  13  17
    2  2  6  10  14  18
    3  3  7  11  15  19
    

    I would use:

    import numpy as np
    
    # random number of columns
    n = np.random.randint(0, df.shape[1])
    
    # pick n random columns
    cols = np.random.choice(df.columns, 3, replace=False)
    
    # shuffle them independently
    df[cols] = df[cols].apply(lambda s: np.random.choice(s, len(s), replace=False))
    

    You can even vectorize the last step with permuted if efficiency is important:

    rng = np.random.default_rng()
    
    # n = rng.integers(0, df.shape[1])
    # cols = rng.choice(df.columns, n, replace=False)
    
    df[cols] = rng.permuted(df[cols], axis=0)
    

    Example output:

       0  1   2   3   4
    0  1  4  11  14  16
    1  0  5   8  15  17
    2  3  6  10  13  18
    3  2  7   9  12  19