pythonpandasdataframepandas-apply

Pandas: error when creating a new column using a function that takes one argument from another column


I have the following data frame df:

df = pd.DataFrame({'result' : ['s17h10e7', 's5e3h2S105h90e15', 
                               's17H10e7S5e3H2s105h90e15'],
                   'status' : [102, 117, 205]})

result                      status
s17h10e7                    102
s5e3h2S105h90e15            117
s17H10e7S5e3H2s105h90e15    205

I have a function named get_number_after_code that reads a string and returns the SUM of any digits that immediately follow a user-defined code (e.g. a letter):

def get_number_after_code(string_to_read, code):

    code_indices = [i for i, char in enumerate(string_to_read) if char == code]

    joined_numbers = []    
    list_of_int_values = []
   
    for idx in code_indices:
        temp_number = []
        for character in string_to_read[idx + 1: ]:
            if not character.isdigit():
                break
            else:
                temp_number.append(character)

            joined_numbers = ''.join(temp_number)
        list_of_int_values.append(int(joined_numbers))

    return sum(list_of_int_values)

Examples:

get_number_after_code('s5e3h2s105h90e15', 'h')
>> 92

get_number_after_code('s5e3h2s105h90e15', 's')
>> 105

I would like to add a column named col_NEW to the df dataframe. This col_NEW column would display the output of the get_number_after_code() function as it is applied to the row element in the result column. As an example, let's assume we use the code 'h' (but it could be either 's' or 'e'). The output would be:

result                        status     col_NEW       
s17h10e7                      102        10
s5e3h2s105h90e15              117        92
s17h10e7s5e3h2s105h807e15     205        819

To do this, I'm using:

df['col_NEW'] = df.apply(get_number_after_code(df['result'], 'h'), axis=1)

I'm getting this not-so-helpful AssertionError:

AssertionError                            Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_21060/915445793.py in <module>
----> 1 df['col_NEW'] = df.apply(count_tests_new(df['result'], 's'), axis=1)

~\anaconda3\lib\site-packages\pandas\core\frame.py in apply(self, func, axis, raw, result_type, args, **kwargs)
   8738             kwargs=kwargs,
   8739         )
-> 8740         return op.apply()
   8741 
   8742     def applymap(

~\anaconda3\lib\site-packages\pandas\core\apply.py in apply(self)
    686             return self.apply_raw()
    687 
--> 688         return self.apply_standard()
    689 
    690     def agg(self):

~\anaconda3\lib\site-packages\pandas\core\apply.py in apply_standard(self)
    810 
    811     def apply_standard(self):
--> 812         results, res_index = self.apply_series_generator()
    813 
    814         # wrap results

~\anaconda3\lib\site-packages\pandas\core\apply.py in apply_series_generator(self)
    816 
    817     def apply_series_generator(self) -> tuple[ResType, Index]:
--> 818         assert callable(self.f)
    819 
    820         series_gen = self.series_generator

AssertionError:

Am I using .apply() syntactically correctly to add col_NEW? If yes, does anyone know what is causing this AssertionError?


Solution

  • You're invoking get_number_after_code on each row, yet passing a Series object to it. Since it seems you only need the "result" column, use apply on that column instead. Also, you can pass the letter (for example "h") as a positional argument. See docs:

    df['col_NEW'] = df['result'].apply(get_number_after_code, args=('h',))
    

    or by its keyword:

    df['col_NEW'] = df['result'].apply(get_number_after_code, code='h')
    

    Output:

                         result  status  col_NEW
    0                  s17h10e7     102       10
    1          s5e3h2S105h90e15     117       92
    2  s17H10e7S5e3H2s105h90e15     205       90