pythonpandascalculated-columns

Calculating based on rows conditions (in pandas)


I encountered the following problem: I have a pandas dataframe that looks like this.

id_tranc sum bid
1 4000 2.3%
1 20000 3.5%
2 100000 if >=100 000 - 1.6%, if < 100 000 - 100$
3 30000 if >=100 000 - 1.6%, if < 100 000 - 100$
1 60000 500$

code_to_create_dataset:

dataframe = pd.DataFrame({
    'id_tranc': [1, 1, 2, 3, 1],
    'sum': [4000, 20000, 100000, 30000, 60000],
    'bid': ['2.3%', '3.5%', 'if >=100 000 - 1.6%, if < 100 000 - 100$', 
            'if >=100 000 - 1.6%, if < 100 000 - 100$', '500$']})

Necessary to calculated 'commission', depending columns 'sum' and 'bid'. Final dataframe should be look like:

id_tranc sum bid comission
1 4000 2.3% 92
1 20000 3.5% 700
2 100000 if >=100 000 - 1.6%, if < 100 000 - 100$ 1600
3 30000 if >=100 000 - 1.6%, if < 100 000 - 100$ 100
1 60000 500$ 500

If calculated with df['commission'] = df['sum'] * df['bid'] - getting result only for first 2 record. Please tell me how to do this correctly.


Solution

  • I would write a small parser based on a regex and operator:

    from operator import ge, lt, gt, le
    import re
    
    def logic(value, bid):
        # define operators, add other ones if needed
        ops = {'>=': ge, '>': gt, '<': lt, '<=': le}
        # remove spaces, split conditions on comma
        conditions = bid.replace(' ', '').split(',')
        # then loop over them, the first match will be used
        for cond in conditions:
            # extract operator, threshold, commission, unit
            m = re.search('(?:if(\W+)(\d+)-)?(\d+\.?\d*)([%$])', cond)
            if not m:     # if no match, ignore
                continue
            op, thresh, com, unit = m.groups()
            # if no condition or condition is valid
            if (not op) or (op and ops[op](value, float(thresh))):
                if unit == '%':                     # handle % case
                    return value * float(com)/100
                elif unit == '$':                   # handle fixed com
                    return float(com)
    
    df['comission'] = [logic(val, bid) for val, bid in zip(df['sum'], df['bid'])]
    
    # or with apply, which is less efficient
    # df['comission'] = df.apply(lambda row: logic(row['sum'], row['bid']), axis=1)
    

    Output:

       id_tranc     sum                                       bid  comission
    0         1    4000                                      2.3%       92.0
    1         1   20000                                      3.5%      700.0
    2         2  100000  if >=100 000 - 1.6%, if < 100 000 - 100$     1600.0
    3         3   30000  if >=100 000 - 1.6%, if < 100 000 - 100$      100.0
    4         1   60000                                      500$      500.0
    

    Regex:

    regex demo

    (?:if(\W+)(\d+)-)?  # optionally match a condition (operator and threshold)
    (\d+\.?\d*)         # match the value of the commission
    ([%$])              # match type of commission (% or $)
    

    Reproducible input:

    df = pd.DataFrame({'id_tranc': [1, 1, 2, 3, 1], 'sum': [4000, 20000, 100000, 30000, 60000],
                       'bid': ['2.3%', '3.5%', 'if >=100 000 - 1.6%, if < 100 000 - 100$',
                               'if >=100 000 - 1.6%, if < 100 000 - 100$', '500$']})