I encountered the following problem: I have a pandas dataframe that looks like this.
id_tranc | sum | bid |
---|---|---|
1 | 4000 | 2.3% |
1 | 20000 | 3.5% |
2 | 100000 | if >=100 000 - 1.6%, if < 100 000 - 100$ |
3 | 30000 | if >=100 000 - 1.6%, if < 100 000 - 100$ |
1 | 60000 | 500$ |
code_to_create_dataset:
dataframe = pd.DataFrame({
'id_tranc': [1, 1, 2, 3, 1],
'sum': [4000, 20000, 100000, 30000, 60000],
'bid': ['2.3%', '3.5%', 'if >=100 000 - 1.6%, if < 100 000 - 100$',
'if >=100 000 - 1.6%, if < 100 000 - 100$', '500$']})
Necessary to calculated 'commission', depending columns 'sum' and 'bid'. Final dataframe should be look like:
id_tranc | sum | bid | comission |
---|---|---|---|
1 | 4000 | 2.3% | 92 |
1 | 20000 | 3.5% | 700 |
2 | 100000 | if >=100 000 - 1.6%, if < 100 000 - 100$ | 1600 |
3 | 30000 | if >=100 000 - 1.6%, if < 100 000 - 100$ | 100 |
1 | 60000 | 500$ | 500 |
If calculated with df['commission'] = df['sum'] * df['bid']
- getting result only for first 2 record. Please tell me how to do this correctly.
I would write a small parser based on a regex and operator
:
from operator import ge, lt, gt, le
import re
def logic(value, bid):
# define operators, add other ones if needed
ops = {'>=': ge, '>': gt, '<': lt, '<=': le}
# remove spaces, split conditions on comma
conditions = bid.replace(' ', '').split(',')
# then loop over them, the first match will be used
for cond in conditions:
# extract operator, threshold, commission, unit
m = re.search('(?:if(\W+)(\d+)-)?(\d+\.?\d*)([%$])', cond)
if not m: # if no match, ignore
continue
op, thresh, com, unit = m.groups()
# if no condition or condition is valid
if (not op) or (op and ops[op](value, float(thresh))):
if unit == '%': # handle % case
return value * float(com)/100
elif unit == '$': # handle fixed com
return float(com)
df['comission'] = [logic(val, bid) for val, bid in zip(df['sum'], df['bid'])]
# or with apply, which is less efficient
# df['comission'] = df.apply(lambda row: logic(row['sum'], row['bid']), axis=1)
Output:
id_tranc sum bid comission
0 1 4000 2.3% 92.0
1 1 20000 3.5% 700.0
2 2 100000 if >=100 000 - 1.6%, if < 100 000 - 100$ 1600.0
3 3 30000 if >=100 000 - 1.6%, if < 100 000 - 100$ 100.0
4 1 60000 500$ 500.0
Regex:
(?:if(\W+)(\d+)-)? # optionally match a condition (operator and threshold)
(\d+\.?\d*) # match the value of the commission
([%$]) # match type of commission (% or $)
Reproducible input:
df = pd.DataFrame({'id_tranc': [1, 1, 2, 3, 1], 'sum': [4000, 20000, 100000, 30000, 60000],
'bid': ['2.3%', '3.5%', 'if >=100 000 - 1.6%, if < 100 000 - 100$',
'if >=100 000 - 1.6%, if < 100 000 - 100$', '500$']})