pythonpandasregionregions

Find adjacent regions in Pandas Series


I would like to select all regions with value above 1 if they are connected to an element with value above 5. Two values are not connected if they are separated by a 0.

For the following data set,

pd.Series(data = [0,2,0,2,3,6,3,0])

the output should be

pd.Series(data = [False,False,False,True,True,True,True,False])

Solution

  • Well, looks like I have found a one-liner, using pandas groupby function:

    import pandas as pd
    
    ts = pd.Series(data = [0,2,0,2,3,6,3,0])
    
    # The flag column allows me to identify sequences. Here 0s are included 
    # in the "sequence", but as you can see in next line doesn't matter 
    df = pd.concat([ts, (ts==0).cumsum()], axis = 1, keys = ['val', 'flag'])
    
    #   val  flag
    #0    0     1
    #1    2     1
    #2    0     2
    #3    2     2
    #4    3     2
    #5    6     2
    #6    3     2
    #7    0     3
    
    # For each group (having the same flag), I do a boolean AND of two conditions:
    # any value above 5  AND value above 1  (which excludes zeros) 
    df.groupby('flag').transform(lambda x: (x>5).any() * x > 1)
    
    #Out[32]: 
    #     val
    #0  False
    #1  False
    #2  False
    #3   True
    #4   True
    #5   True
    #6   True
    #7  False
    

    If you are wondering, you can collapse everything in one line:

    ts.groupby((ts==0).cumsum()).transform(lambda x: (x>5).any() * x > 1).astype(bool)
    

    I still leave for reference my first approach:

    import itertools
    import pandas as pd
    
    def flatten(l):
        # Util function to flatten a list of lists
        # e.g. [[1], [2,3]] -> [1,2,3]
        return list(itertools.chain(*l))
    
    ts = pd.Series(data = [0,2,0,2,3,6,3,0])
    #Get data as list
    values = ts.values.tolist()
    
    # From what I understand the 0s delimit subsequences (so numbers are not
    # connected if separated by a 0
    
    # Get location of zeros
    gap_loc = [idx for (idx, el) in enumerate(values) if el==0]  
    # Re-create pandas series
    gap_series = pd.Series(False, index = gap_loc)
    
    # Get values and locations of the subsequences (i.e. seperated by zeros)
    valid_loc = [range(prev_gap+1,gap) for prev_gap, gap in zip(gap_loc[:-1],gap_loc[1:])]
    list_seq = [values[prev_gap+1:gap] for prev_gap, gap in zip(gap_loc[:-1],gap_loc[1:])]
    # list_seq = [[2], [2, 3, 6, 3]]
    
    # Verify your condition
    check_condition = [[el>1 and any(map(lambda x: x>5, sublist)) for el in sublist] 
                         for sublist in list_seq]
    # Put results back into a pandas Series
    valid_series = pd.Series(flatten(check_condition), index = flatten(valid_loc))
    
    # Put everything together:
    result = pd.concat([gap_series, valid_series], axis = 0).sort_index()
    
    #result
    #Out[101]: 
    #0    False
    #1    False
    #2    False
    #3     True
    #4     True
    #5     True
    #6     True
    #7    False
    #dtype: bool