pythonpandasselectfilterconditional-statements

Selecting with complex criteria from pandas.DataFrame


For example I have simple DF:

import pandas as pd
from random import randint

df = pd.DataFrame({'A': [randint(1, 9) for x in range(10)],
                   'B': [randint(1, 9)*10 for x in range(10)],
                   'C': [randint(1, 9)*100 for x in range(10)]})

Can I select values from 'A' for which corresponding values for 'B' will be greater than 50, and for 'C' - not equal to 900, using methods and idioms of Pandas?


Solution

  • Sure! Setup:

    >>> import pandas as pd
    >>> from random import randint
    >>> df = pd.DataFrame({'A': [randint(1, 9) for x in range(10)],
                       'B': [randint(1, 9)*10 for x in range(10)],
                       'C': [randint(1, 9)*100 for x in range(10)]})
    >>> df
       A   B    C
    0  9  40  300
    1  9  70  700
    2  5  70  900
    3  8  80  900
    4  7  50  200
    5  9  30  900
    6  2  80  700
    7  2  80  400
    8  5  80  300
    9  7  70  800
    

    We can apply column operations and get boolean Series objects:

    >>> df["B"] > 50
    0    False
    1     True
    2     True
    3     True
    4    False
    5    False
    6     True
    7     True
    8     True
    9     True
    Name: B
    >>> (df["B"] > 50) & (df["C"] != 900)
    

    or

    >>> (df["B"] > 50) & ~(df["C"] == 900)
    0    False
    1    False
    2     True
    3     True
    4    False
    5    False
    6    False
    7    False
    8    False
    9    False
    

    [Update, to switch to new-style .loc]:

    And then we can use these to index into the object. For read access, you can chain indices:

    >>> df["A"][(df["B"] > 50) & (df["C"] != 900)]
    2    5
    3    8
    Name: A, dtype: int64
    

    but you can get yourself into trouble because of the difference between a view and a copy doing this for write access. You can use .loc instead:

    >>> df.loc[(df["B"] > 50) & (df["C"] != 900), "A"]
    2    5
    3    8
    Name: A, dtype: int64
    >>> df.loc[(df["B"] > 50) & (df["C"] != 900), "A"].values
    array([5, 8], dtype=int64)
    >>> df.loc[(df["B"] > 50) & (df["C"] != 900), "A"] *= 1000
    >>> df
          A   B    C
    0     9  40  300
    1     9  70  700
    2  5000  70  900
    3  8000  80  900
    4     7  50  200
    5     9  30  900
    6     2  80  700
    7     2  80  400
    8     5  80  300
    9     7  70  800