Perform a binary op on values in a pandas dataframe column by a value in that same column chosen based on a value in another column

Sorry for the mouthful title. I think this is best illustrated by an example. Let's say we have an item that has different rarity levels, all of which have dfferent prices in different shops. I want to know how much more expensive a given rarity is than the base "normal" rarity in each shop separately.

How could I add a new "premium" column that would give me the result of dividing the price of the item with a given rarity at a given shop, by the price of that item in "normal" quality at that shop in particular? The result can be seen in the table below.

item	quality	price	shop	premium (todo)	base price
bread	normal	2.0	Lumbridge	1.0	2.0
bread	rare	3.0	Lumbridge	1.5	2.0
bread	legendary	5.0	Lumbridge	2.5	2.0
bread	normal	1.5	Varrock	1.0	1.5
bread	rare	4.5	Varrock	3.0	1.5
bread	legendary	6.0	Varrock	4.0	1.5
bread	normal	3.0	Yanille	1.0	3.0
bread	rare	2.0	Yanille	0.66	3.0
bread	legendary	4.0	Yanille	1.33	3.0

I thought about repeating the rows with normal quality as a new column (given as the "base price" column), but I don't see any mechanism that could allow for that. If this is possible could it be done if instead of just one "item" we had many (i.e. extending the filtering to multiple columns)?

Solution

Here's one approach, assuming a single 'normal' price per item / shop.

Data used

# adding another item, and deleting 1 shop

import pandas as pd

data = {'item': {0: 'bread', 1: 'bread', 2: 'bread', 3: 'bread', 4: 'bread', 
                 5: 'bread'}, 
        'quality': {0: 'normal', 1: 'rare', 2: 'legendary', 3: 'normal', 
                    4: 'rare', 5: 'legendary'}, 
        'price': {0: 2.0, 1: 3.0, 2: 5.0, 3: 1.5, 4: 4.5, 5: 6.0}, 
        'shop': {0: 'Lumbridge', 1: 'Lumbridge', 2: 'Lumbridge', 3: 'Varrock', 
                 4: 'Varrock', 5: 'Varrock', }, 
        'premium (todo)': {0: 1.0, 1: 1.5, 2: 2.5, 3: 1.0, 4: 3.0, 5: 4.0}, 
        'base price': {0: 2.0, 1: 2.0, 2: 2.0, 3: 1.5, 4: 1.5, 5: 1.5}}
df = pd.DataFrame(data)

df = pd.concat([df, df.assign(item=df['item'].replace('bread', 'milk'))], 
               ignore_index=True)

Code

df = (
    df.merge(
        df.query('quality == "normal"')[['item', 'shop', 'price']],
        on=['item', 'shop'],
        suffixes=('', '_base'),
        how='left'
    )
    .assign(premium=lambda x: x['price'].div(x.pop('price_base')))
)

Output:

     item    quality  price       shop  premium (todo)  base price  premium
0   bread     normal    2.0  Lumbridge             1.0         2.0      1.0
1   bread       rare    3.0  Lumbridge             1.5         2.0      1.5
2   bread  legendary    5.0  Lumbridge             2.5         2.0      2.5
3   bread     normal    1.5    Varrock             1.0         1.5      1.0
4   bread       rare    4.5    Varrock             3.0         1.5      3.0
5   bread  legendary    6.0    Varrock             4.0         1.5      4.0
6    milk     normal    2.0  Lumbridge             1.0         2.0      1.0
7    milk       rare    3.0  Lumbridge             1.5         2.0      1.5
8    milk  legendary    5.0  Lumbridge             2.5         2.0      2.5
9    milk     normal    1.5    Varrock             1.0         1.5      1.0
10   milk       rare    4.5    Varrock             3.0         1.5      3.0
11   milk  legendary    6.0    Varrock             4.0         1.5      4.0

Explanation / intermediates

Use df.query to filter on 'normal' and select only columns ['item', 'shop', 'price'].

df.query('quality == "normal"')[['item', 'shop', 'price']]

    item       shop  price
0  bread  Lumbridge    2.0
3  bread    Varrock    1.5
6   milk  Lumbridge    2.0
9   milk    Varrock    1.5

Pass the filtered df to df.merge as right, left joining on ['item', 'shop'] and adding a suffix for column 'price' coming from right (_base). This generates your 'base price' column:

# df.merge(...)['price_base']

0     2.0
1     2.0
2     2.0
3     1.5
4     1.5
5     1.5
6     2.0
7     2.0
8     2.0
9     1.5
10    1.5
11    1.5
Name: price_base, dtype: float64

Chain df.assign and use Series.div with Series.pop for the divisor, assigning the result as new column 'premium'.

Easier alternative, if:

Your df is properly sorted on 'item', 'quality' (categorical, with 'normal' first), 'shop' (as in your example), and;
Each 'normal' price has a non-NaN value.

df['premium'] = df['price'].div(
    df['price'].where(df['quality'] == 'normal').ffill()
    )

Using Series.div + Series.where + Series.ffill.