pythonpandasdataframeassign

pandas combine a criteria with multiple rows to a comma dilimeted column


I'm trying to combine all URL fields returned by an id in the same DF into a comma delimitted field. What I can't figure out is why my assign is not working and my for loop is..

The table looks like as follows:

Ref Option1_Ref URL
1 2,3,4 /path1
2 1,4,5 /path2
3 1,6 /path3
4 1,5 /path4
4 2,5 /path5
5 3,1 /path6
7 2,5 /path7

using assign :

df=df.assign(options_url1=df.loc[df['Ref'].isin(df["Option1_Ref"].str.split(',')),['URL','Ref']].drop_duplicates('Ref')['URL'].str.join(','))

This is just returning an empty column.

for loop:

for x in df.index: df.loc[x,'options_url1']=','.join(df.loc[df['Ref'].isin(df.loc[x,"Option1_Ref"].split(',')),['URL','Ref']].drop_duplicates('Ref')['URL'].array)

This is working as expected

I'm expecting:

Ref Option1_Ref URL options_url1
1 2,3,4 /path1 /path2,/path3,/path4
2 1,4,5 /path2 /path1,/path4,/path6
3 1,6 /path3 /path1
4 1,5 /path4 /path1,/path6
4 2,5 /path5 /path2,/path6
5 3,1 /path6 /path1,/path3
7 2,5 /path5 /path2,/path6

Solution

  • Create Series by remove duplicates by DataFrame.drop_duplicates, convert Ref to strings and map splitted values with DataFrame.explode by Series.map, last aggregate join:

    s = df.drop_duplicates('Ref').astype({'Ref':str}).set_index('Ref')['URL']
    
    df['options_url1'] = (df["Option1_Ref"].str.split(',').explode().map(s).dropna()
                                           .groupby(level=0).agg(','.join))
    print (df)
       Ref Option1_Ref     URL          options_url1
    0    1       2,3,4  /path1  /path2,/path3,/path4
    1    2       1,4,5  /path2         /path1,/path4
    2    3           1  /path3                /path1
    3    4         1,5  /path4                /path1
    4    6         2,5  /path5                /path2
    

    Or use list comprhension for mapping and join, if possible some values not exist add if statement:

    s = df.drop_duplicates('Ref').astype({'Ref':str}).set_index('Ref')['URL']
    
    df['options_url1'] = [','.join(s.get(y) for y in x.split(',') if y in s)
                          for x in df["Option1_Ref"]]
    print (df)
       Ref Option1_Ref     URL          options_url1
    0    1       2,3,4  /path1  /path2,/path3,/path4
    1    2       1,4,5  /path2         /path1,/path4
    2    3           1  /path3                /path1
    3    4         1,5  /path4                /path1
    4    6         2,5  /path5                /path2