pythonpandasdataframe

Perform a binary op on values in a pandas dataframe column by a value in that same column chosen based on a value in another column


Sorry for the mouthful title. I think this is best illustrated by an example. Let's say we have an item that has different rarity levels, all of which have dfferent prices in different shops. I want to know how much more expensive a given rarity is than the base "normal" rarity in each shop separately.

How could I add a new "premium" column that would give me the result of dividing the price of the item with a given rarity at a given shop, by the price of that item in "normal" quality at that shop in particular? The result can be seen in the table below.

item quality price shop premium (todo) base price
bread normal 2.0 Lumbridge 1.0 2.0
bread rare 3.0 Lumbridge 1.5 2.0
bread legendary 5.0 Lumbridge 2.5 2.0
bread normal 1.5 Varrock 1.0 1.5
bread rare 4.5 Varrock 3.0 1.5
bread legendary 6.0 Varrock 4.0 1.5
bread normal 3.0 Yanille 1.0 3.0
bread rare 2.0 Yanille 0.66 3.0
bread legendary 4.0 Yanille 1.33 3.0

I thought about repeating the rows with normal quality as a new column (given as the "base price" column), but I don't see any mechanism that could allow for that. If this is possible could it be done if instead of just one "item" we had many (i.e. extending the filtering to multiple columns)?


Solution

  • Here's one approach, assuming a single 'normal' price per item / shop.

    Data used

    # adding another item, and deleting 1 shop
    
    import pandas as pd
    
    data = {'item': {0: 'bread', 1: 'bread', 2: 'bread', 3: 'bread', 4: 'bread', 
                     5: 'bread'}, 
            'quality': {0: 'normal', 1: 'rare', 2: 'legendary', 3: 'normal', 
                        4: 'rare', 5: 'legendary'}, 
            'price': {0: 2.0, 1: 3.0, 2: 5.0, 3: 1.5, 4: 4.5, 5: 6.0}, 
            'shop': {0: 'Lumbridge', 1: 'Lumbridge', 2: 'Lumbridge', 3: 'Varrock', 
                     4: 'Varrock', 5: 'Varrock', }, 
            'premium (todo)': {0: 1.0, 1: 1.5, 2: 2.5, 3: 1.0, 4: 3.0, 5: 4.0}, 
            'base price': {0: 2.0, 1: 2.0, 2: 2.0, 3: 1.5, 4: 1.5, 5: 1.5}}
    df = pd.DataFrame(data)
    
    df = pd.concat([df, df.assign(item=df['item'].replace('bread', 'milk'))], 
                   ignore_index=True)
    

    Code

    df = (
        df.merge(
            df.query('quality == "normal"')[['item', 'shop', 'price']],
            on=['item', 'shop'],
            suffixes=('', '_base'),
            how='left'
        )
        .assign(premium=lambda x: x['price'].div(x.pop('price_base')))
    )
    

    Output:

         item    quality  price       shop  premium (todo)  base price  premium
    0   bread     normal    2.0  Lumbridge             1.0         2.0      1.0
    1   bread       rare    3.0  Lumbridge             1.5         2.0      1.5
    2   bread  legendary    5.0  Lumbridge             2.5         2.0      2.5
    3   bread     normal    1.5    Varrock             1.0         1.5      1.0
    4   bread       rare    4.5    Varrock             3.0         1.5      3.0
    5   bread  legendary    6.0    Varrock             4.0         1.5      4.0
    6    milk     normal    2.0  Lumbridge             1.0         2.0      1.0
    7    milk       rare    3.0  Lumbridge             1.5         2.0      1.5
    8    milk  legendary    5.0  Lumbridge             2.5         2.0      2.5
    9    milk     normal    1.5    Varrock             1.0         1.5      1.0
    10   milk       rare    4.5    Varrock             3.0         1.5      3.0
    11   milk  legendary    6.0    Varrock             4.0         1.5      4.0
    

    Explanation / intermediates

    df.query('quality == "normal"')[['item', 'shop', 'price']]
    
        item       shop  price
    0  bread  Lumbridge    2.0
    3  bread    Varrock    1.5
    6   milk  Lumbridge    2.0
    9   milk    Varrock    1.5
    
    # df.merge(...)['price_base']
    
    0     2.0
    1     2.0
    2     2.0
    3     1.5
    4     1.5
    5     1.5
    6     2.0
    7     2.0
    8     2.0
    9     1.5
    10    1.5
    11    1.5
    Name: price_base, dtype: float64
    

    Easier alternative, if:

    1. Your df is properly sorted on 'item', 'quality' (categorical, with 'normal' first), 'shop' (as in your example), and;
    2. Each 'normal' price has a non-NaN value.
    df['premium'] = df['price'].div(
        df['price'].where(df['quality'] == 'normal').ffill()
        )
    

    Using Series.div + Series.where + Series.ffill.