pythonpython-3.xpandasdataframejupyter

How to randomly split a DataFrame into several smaller DataFrames?


I'm having trouble randomly splitting DataFrame df into groups of smaller DataFrames.

df
  movie_id  1   2   4   5   6   7   8   9   10  11  12  borda
0   1       5   4   0   4   4   0   0   0   4   0   0   21
1   2       3   0   0   3   0   0   0   0   0   0   0   6   
2   3       4   0   0   0   0   0   0   0   0   0   0   4   
3   4       3   0   0   0   0   5   0   0   4   0   5   17  
4   5       3   0   0   0   0   0   0   0   0   0   0   3   
5   6       5   0   0   0   0   0   0   5   0   0   0   10  
6   7       4   0   0   0   2   5   3   4   4   0   0   22  
7   8       1   0   0   0   4   5   0   0   0   4   0   14  
8   9       5   0   0   0   4   5   0   0   4   5   0   23  
9   10      3   2   0   0   0   4   0   0   0   0   0   9   
10  11      2   0   4   0   0   3   3   0   4   2   0   18  
11  12      5   0   0   0   4   5   0   0   5   2   0   21  
12  13      5   4   0   0   2   0   0   0   3   0   0   14  
13  14      5   4   0   0   5   0   0   0   0   0   0   14  
14  15      5   0   0   0   3   0   0   0   0   5   5   18  
15  16      5   0   0   0   0   0   0   0   4   0   0   9   
16  17      3   0   0   4   0   0   0   0   0   0   0   7   
17  18      4   0   0   0   0   0   0   0   0   0   0   4   
18  19      5   3   0   0   4   0   0   0   0   0   0   12  
19  20      4   0   0   0   0   0   0   0   0   0   0   4   
20  21      1   0   0   3   3   0   0   0   0   0   0   7   
21  22      4   0   0   0   3   5   5   0   5   4   0   26  
22  23      4   0   0   0   4   3   0   0   5   0   0   16  
23  24      3   0   0   4   0   0   0   0   0   3   0   10  

I've tried sample and arange, but with bad results.

ran1 = df.sample(frac=0.2, replace=False, random_state=1)
ran2 = df.sample(frac=0.2, replace=False, random_state=1)
ran3 = df.sample(frac=0.2, replace=False, random_state=1)
ran4 = df.sample(frac=0.2, replace=False, random_state=1)
ran5 = df.sample(frac=0.2, replace=False, random_state=1)

print(ran1, '\n')
print(ran2, '\n')
print(ran3, '\n')
print(ran4, '\n')
print(ran5, '\n')

This turned out to be 5 exact same DataFrames.

   movie_id  1  2  4  5  6  7  8  9  10  11  12  borda  
13    14     5  4  0  0  5  0  0  0   0   0   0     14  
18    19     5  3  0  0  4  0  0  0   0   0   0     12  
3     4      3  0  0  0  0  5  0  0   4   0   5     17  
14    15     5  0  0  0  3  0  0  0   0   5   5     18  
20    21     1  0  0  3  3  0  0  0   0   0   0      7  

Also I've tried :

g = df.groupby(['movie_id'])
h = np.arange(g.ngroups)
np.random.shuffle(h)

df[g.ngroup().isin(h[:6])]

The output :

    movie_id    1   2   4   5   6   7   8   9   10  11  12  borda   
4      5        3   0   0   0   0   0   0   0   0   0   0   3   
6      7        4   0   0   0   2   5   3   4   4   0   0   22  
7      8        1   0   0   0   4   5   0   0   0   4   0   14  
16     17       3   0   0   4   0   0   0   0   0   0   0   7   
17     18       4   0   0   0   0   0   0   0   0   0   0   4   
18     19       5   3   0   0   4   0   0   0   0   0   0   12  

But there's still only one smaller group, other datas from df aren't grouped.

I'm expecting the smaller groups to be split evenly by using percentage. And the whole df should be split into groups.


Solution

  • Use np.array_split

    shuffled = df.sample(frac=1)
    result = np.array_split(shuffled, 5)  
    

    df.sample(frac=1) shuffle the rows of df. Then use np.array_split split it into parts that have equal size.

    It gives you:

    for part in result:
        print(part,'\n')
    
        movie_id  1  2  4  5  6  7  8  9  10  11  12  borda
    5          6  5  0  0  0  0  0  0  5   0   0   0     10
    4          5  3  0  0  0  0  0  0  0   0   0   0      3
    7          8  1  0  0  0  4  5  0  0   0   4   0     14
    16        17  3  0  0  4  0  0  0  0   0   0   0      7
    22        23  4  0  0  0  4  3  0  0   5   0   0     16 
    
        movie_id  1  2  4  5  6  7  8  9  10  11  12  borda
    13        14  5  4  0  0  5  0  0  0   0   0   0     14
    14        15  5  0  0  0  3  0  0  0   0   5   5     18
    21        22  4  0  0  0  3  5  5  0   5   4   0     26
    1          2  3  0  0  3  0  0  0  0   0   0   0      6
    20        21  1  0  0  3  3  0  0  0   0   0   0      7 
    
        movie_id  1  2  4  5  6  7  8  9  10  11  12  borda
    10        11  2  0  4  0  0  3  3  0   4   2   0     18
    9         10  3  2  0  0  0  4  0  0   0   0   0      9
    11        12  5  0  0  0  4  5  0  0   5   2   0     21
    8          9  5  0  0  0  4  5  0  0   4   5   0     23
    12        13  5  4  0  0  2  0  0  0   3   0   0     14 
    
        movie_id  1  2  4  5  6  7  8  9  10  11  12  borda
    18        19  5  3  0  0  4  0  0  0   0   0   0     12
    3          4  3  0  0  0  0  5  0  0   4   0   5     17
    0          1  5  4  0  4  4  0  0  0   4   0   0     21
    23        24  3  0  0  4  0  0  0  0   0   3   0     10
    6          7  4  0  0  0  2  5  3  4   4   0   0     22 
    
        movie_id  1  2  4  5  6  7  8  9  10  11  12  borda
    17        18  4  0  0  0  0  0  0  0   0   0   0      4
    2          3  4  0  0  0  0  0  0  0   0   0   0      4
    15        16  5  0  0  0  0  0  0  0   4   0   0      9
    19        20  4  0  0  0  0  0  0  0   0   0   0      4