pythonpandasdataframeif-statementconditional-statements

Compare two columns using pandas


Using this as a starting point:

a = [['10', '1.2', '4.2'], ['15', '70', '0.03'], ['8', '5', '0']]
df = pd.DataFrame(a, columns=['one', 'two', 'three'])

which looks like

  one  two three
0   10  1.2   4.2
1   15  70   0.03
2    8   5     0

I want to use something like an if statement within pandas.

if df['one'] >= df['two'] and df['one'] <= df['three']:
    df['que'] = df['one']

Basically, create a new column by checking each row via the if statement.

The docs say to use .all but there is no example...


Solution

  • You could use np.where. If cond is a boolean array, and A and B are arrays, then

    C = np.where(cond, A, B)
    

    defines C to be equal to A where cond is True, and B where cond is False.

    import numpy as np
    import pandas as pd
    
    a = [['10', '1.2', '4.2'], ['15', '70', '0.03'], ['8', '5', '0']]
    df = pd.DataFrame(a, columns=['one', 'two', 'three'])
    
    df['que'] = np.where((df['one'] >= df['two']) & (df['one'] <= df['three'])
                         , df['one'], np.nan)
    

    yields

      one  two three  que
    0  10  1.2   4.2   10
    1  15   70  0.03  NaN
    2   8    5     0  NaN
    

    If you have more than one condition, then you could use np.select instead. For example, if you wish df['que'] to equal df['two'] when df['one'] < df['two'], then

    conditions = [
        (df['one'] >= df['two']) & (df['one'] <= df['three']), 
        df['one'] < df['two']]
    
    choices = [df['one'], df['two']]
    
    df['que'] = np.select(conditions, choices, default=np.nan)
    

    yields

      one  two three  que
    0  10  1.2   4.2   10
    1  15   70  0.03   70
    2   8    5     0  NaN
    

    If we can assume that df['one'] >= df['two'] when df['one'] < df['two'] is False, then the conditions and choices could be simplified to

    conditions = [
        df['one'] < df['two'],
        df['one'] <= df['three']]
    
    choices = [df['two'], df['one']]
    

    (The assumption may not be true if df['one'] or df['two'] contain NaNs.)


    Note that

    a = [['10', '1.2', '4.2'], ['15', '70', '0.03'], ['8', '5', '0']]
    df = pd.DataFrame(a, columns=['one', 'two', 'three'])
    

    defines a DataFrame with string values. Since they look numeric, you might be better off converting those strings to floats:

    df2 = df.astype(float)
    

    This changes the results, however, since strings compare character-by-character, while floats are compared numerically.

    In [61]: '10' <= '4.2'
    Out[61]: True
    
    In [62]: 10 <= 4.2
    Out[62]: False