pythonpandassplitmeltchain

Chain df.str.split() in pandas dataframe


Edit: 2022NOV21

How do we chain df.col.str.split() since this returns the split columns if expand = True I am trying to split a column after performing .melt(). If I use assign I end up using the original column and the melted column actually does not even exist.

df = pd.DataFrame().from_dict({
    'id' : [1,2,3,4],
    '2022_amt' : [10.1,20.2,30.3, 40.4],
    '2022_qty' : [10,20,30,40]
})

df = ( 
    df
    .melt(
         id_vars=['id'], 
         value_vars=['2022_amt', '2022_qty'], 
         var_name='fy', 
         value_name='num' 
    )
    # can i chain any pd.Series.str.[METHOD] here
    # .assign(
    #         year=df.fy.str.split('_', expand=True)[0], 
    #         t=df.fy.str.split('_', expand=True)[1]
    # )
    )

# i can add the two columns in this way but can we use chain to expand dataframe df
df[['year', 't']] = df.fy.str.split('_', expand=True)
df = df.drop(columns = ['fy'])

Solution

  • Using expand converts it into a DataFrame, which you do not really want here; secondly with chaining, use an anonymous function to refer to the previous dataframe:

    (df
    .melt(id_vars='id',var_name='fy',value_name='num')
    assign(year = lambda df: df.fy.str.split('_').str[0],
           t = lambda df: df.fy.str.split('_').str[1])
    )
    
       id        fy   num  year    t
    0   1  2022_amt  10.1  2022  amt
    1   2  2022_amt  20.2  2022  amt
    2   3  2022_amt  30.3  2022  amt
    3   4  2022_amt  40.4  2022  amt
    4   1  2022_qty  10.0  2022  qty
    5   2  2022_qty  20.0  2022  qty
    6   3  2022_qty  30.0  2022  qty
    7   4  2022_qty  40.0  2022  qty
    

    For your use case, there are simpler, more efficient ways to do this:

    df = df.set_index('id')
    df.columns = df.columns.str.split('_', expand = True)
    df.columns.names = ['year', 't']
    df.stack(['year', 't']).reset_index(name='num')
    
       id  year    t   num
    0   1  2022  amt  10.1
    1   1  2022  qty  10.0
    2   2  2022  amt  20.2
    3   2  2022  qty  20.0
    4   3  2022  amt  30.3
    5   3  2022  qty  30.0
    6   4  2022  amt  40.4
    7   4  2022  qty  40.0
    
    # pip install pyjanitor
    import pandas as pd
    import janitor as jn
    df.pivot_longer(index = 'id', names_to = ('year','t'), names_sep = '_')
    
       id  year    t  value
    0   1  2022  amt   10.1
    1   2  2022  amt   20.2
    2   3  2022  amt   30.3
    3   4  2022  amt   40.4
    4   1  2022  qty   10.0
    5   2  2022  qty   20.0
    6   3  2022  qty   30.0
    7   4  2022  qty   40.0