pythonpandasdataframe

Unique item per entity


I want to get the information per Name.

df=pd.DataFrame({'Name':['A','A','A','A','B','B','B','B'],
              'Pair':['b','c','d','e','b','f','d','g'],
              'List 1':['1,2','2,3,4',np.nan,'5',np.nan,'1,2,3','3,6','5,4'],
              'List 2':['1,2,3',',3,4','3','4,5',np.nan,'1,3','1,3,6','5,4']})


For output, I want to split 'Pair' and 'List 1' and 'List 2' into digits and join them.

  1. 'Unique Pair': group by each name and get all unique letters. Append them together in a sorted manner.

  2. 'Unique Item': group by each name and split it and drop NA and join all unique letters and sort letters.

  3. 'All Item': the only difference is that it does not join unique letters but all letters appeared and sort letters. For example, there are two '2' for name A, so it appears twice.

out=pd.DataFrame({'Name':['A','B'],
                  'Unique Pair':['b,c,d,e','b,d,f,g'],
                  'Unique List 1 Item':['1,2,3,4,5','1,2,3,4,5,6'],
                  'All List 1 Item':['1,2,2,3,4,5','1,2,3,3,4,5,6'],
                  'Unique List 2 Item':['1,2,3,4,5','1,3,4,5,6'],
                  'All List 2 Item':['1,2,3,3,3,4,4,5','1,1,3,3,4,5,6']
             })

I have tried this but I don't know how to append all items together.

df.groupby('Name')['List'].fillna('NA').apply(lambda x: x.split(','))

My raw dataset has more than 500,000,000 rows, so I am looking for an efficient solution.


Solution

  • Here's one approach with df.groupby:

    out = df.groupby('Name', as_index=False).agg(
        **{
           'Unique Pair': ('Pair', lambda x: 
               ','.join(np.unique(x))
                           ),
           'Unique Item': ('List', lambda x: ','.join(
               np.unique(x.dropna().str.split(',').explode()))
               ),
           'All Item': ('List', lambda x: ','.join(
               x.dropna().str.split(',').explode().sort_values())
               )
        })
    

    Output:

      Name Unique Pair  Unique Item       All Item
    0    A     b,c,d,e    1,2,3,4,5    1,2,2,3,4,5
    1    B     b,d,f,g  1,2,3,4,5,6  1,2,3,3,4,5,6
    

    Explanation

    Use groupby.agg:

    On the use of named aggregations, see here.


    To avoid repetition of x.dropna().str.split(',').explode() (which is expensive) for both 'Unique Item' and 'All Item', you can also adjust your df before starting the groupby aggregations. E.g., do something like this:

    df = (df.assign(List=df['List'].dropna().str.split(','))
          .explode('List', ignore_index=True)
          )
    
    out2 = df.groupby('Name', as_index=False).agg(
        **{
           'Unique Pair': ('Pair', lambda x: ','.join(np.unique(x))),
           'Unique Item': ('List', lambda x: ','.join(np.unique(x.dropna()))),
           'All Item': ('List', lambda x: ','.join(x.dropna().sort_values()))
        })
    

    Output:

    out2.equals(out)
    # True
    

    If performance is important, and you are working with a large dataset, this will probably be faster.