pythonpandasdataframesubset

(Python/Pandas) Subset DataFrame based on non-missing values from a column


I have a pd dataframe:

import pandas as pd
column1 = [None,None,None,4,8,9,None,None,None,2,3,5,None]
column2 = [None,None,None,None,5,1,None,None,6,3,3,None,None]
column3 = [None,None,None,3,None,7,None,None,7,None,None,1,None]
df = pd.DataFrame(np.column_stack([column1, column2,column3]),columns=['column1', 'column2', 'column3'])

print(df)
   column1 column2 column3
0     None    None    None
1     None    None    None
2     None    None    None
3        4    None       3
4        8       5    None
5        9       1       7
6     None    None    None
7     None    None    None
8     None       6       7
9        2       3    None
10       3       3    None
11       5    None       1
12    None    None    None

I want to subset the rows between the values in column 3, and get rid of all empty rows. My desired outcomes are:

print (df1)   
   column1 column2 column3
0        4    None       3
1        8       5    None
2        9       1       7

print(df2)
   column1 column2 column3
0     None       6       7
1        2       3    None
2        3       3    None
3        5    None       1

I don't care about the actual values column3. Column 3 values are used to indicate "start" and "stop".


Solution

  • You can find the non-na value, then perform a cumulative sum, then mod 2 to get the "groups" of start and one-less-than stop positions. Shifting this by 1, adding to the original, and clipping to (0, 1) gets clumps of the start and stop points.
    To label the groups, you can take a diff of 1, then clip to (0, 1) again, and cum sum, then multiply those two together.

    g_small = (~df.column3.isna()).cumsum().mod(2)
    g = (g_small  + g_small .shift(1, fill_value=0)).clip(0,1)
    
    groups = g.diff(1).fillna(0).clip(0,1).cumsum().astype(int) * g
    

    You can then do a groupby operation on the data frame:

    dfs = {i: g for i, g in df.groupby(groups) if i > 0}
    
    dfs
    # returns:
    {1:
       column1 column2 column3
     3       4    None       3
     4       8       5    None
     5       9       1       7,
    
     2:
        column1 column2 column3
     8     None       6       7
     9        2       3    None
     10       3       3    None
     11       5    None       1}