pandasgroup-bystartswith

How to check that rows start with or contain same value when grouped by other column in pandas


I have a dataframe that I'd like to group by id and the common 'top level' string with the count of 'lower-level' strings it includes. For example,

    id name
    1  AA-BB-CC-DD
    1  AA-BB-CC
    1  AA-BB-CC-DD-EE
    1  AA-BB-UU-VV
    1  AA-BB-UU
    1  FF-MM-NN
    1  FF-MM-NN-PP
    2  XX-YY
    2  XX-ZZ
    2  XX-ZZ-AA

for id 1 the first top level of column name is AA, second BB, third CC, etc. The common 'highest' level for id 1 in the data frame is AA-BB-CC.

The desired output is

    id  name      count
    1   AA-BB-CC  3
    1   AA-BB-UU  2
    1   FF-MM-NN  2
    2   XX-YY     1
    2   XX-ZZ     2

Thank you.


Solution

  • IIUC, you can form groups using the first level, then groupby.agg using a custom function:

    group = df['name'].str.extract('^([^-]+)', expand=False)
    
    def min_string(s):
        return min(s, key=lambda x: x.count('-'))
    
    out = (df
     .groupby(['id', group], as_index=False)
     .agg(name=('name', min_string),
          count=('name', 'count')
         )
    )
    

    output:

       id      name  count
    0   1  AA-BB-CC      3
    1   1  FF-MM-NN      2
    2   2     XX-YY      3
    

    update: handle minimal set of first level(s)

    change min_string to:

    def min_string(s):
        return '-'.join(x[0] for x in zip(*s.str.split('-'))
                        if len(set(x)) == 1)
    

    example input:

       id            name
    0   1     AA-BB-CC-DD
    1   1        AA-BB-CC
    2   1  AA-BB-CC-DD-EE
    3   1     AA-BB-UU-VV
    4   1        AA-BB-UU
    5   1        FF-MM-NN
    6   1     FF-MM-NN-PP
    7   2           XX-YY
    8   2           XX-ZZ
    9   2        XX-ZZ-AA
    

    output:

       id      name  count
    0   1     AA-BB      5
    1   1  FF-MM-NN      2
    2   2        XX      3
    

    update2: maximal common subgroup

    def make_groups(s, sep='-'):
        d = {}
        s = s.str.split(sep).sort_values()
        prev = s.iloc[0]
        for idx, val in s.items():
            if val[:len(prev)] != prev:
                prev = val
            d[idx] = sep.join(prev)
        return pd.Series(d, index=s.index)
    
    group = df.groupby('id', group_keys=False)['name'].apply(make_groups)
    
    out = (df
     .groupby(['id', group], as_index=False)
     .agg(name=('name', min_string),
          count=('name', 'count')
         )
    )
    

    output:

       id      name  count
    0   1  AA-BB-CC      3
    1   1  AA-BB-UU      2
    2   1  FF-MM-NN      2
    3   2     XX-YY      1
    4   2     XX-ZZ      2
    

    used input:

       id            name
    0   1     AA-BB-CC-DD
    1   1        AA-BB-CC
    2   1  AA-BB-CC-DD-EE
    3   1     AA-BB-UU-VV
    4   1        AA-BB-UU
    5   1        FF-MM-NN
    6   1     FF-MM-NN-PP
    7   2           XX-YY
    8   2           XX-ZZ
    9   2        XX-ZZ-AA