pythonpandasdataframenumpyindexing

How to find last occurrence of value meeting condition in column in Python


I have the following dataframe:

df = pd.DataFrame({"A":['a','b','c','d','e','f','g','h','i','j','k'],
                    "B":[1,3,4,5,6,7,6,5,8,5,5]})
df

displayed as:

     A    B
0    a    1
1    b    3
2    c    4
3    d    5
4    e    6
5    f    7
6    g    6
7    h    5
8    i    8
9    j    5
10   k    5

I first want to find the letter in column "A" that corresponds to the first occurrence of a value in column "B" that is >= 6. Looking at this, we see that this would be row index 4, corresponding to a value of 6 and "e" in column "A".

I can identify the column "A" value we just got with this code:

#Find first occurrence >= threshold
threshold = 6
array = df.values
array[np.where(array[:,1] >= threshold)][0,0]

This code returns 'e', which is what I want.

This code is referenced from this Stack Overflow source: Python find first occurrence in Pandas dataframe column 2 below threshold and return column 1 value same row using NumPy

What I am having trouble figuring out is how to modify this code to find the last occurrence meeting my criteria of being >= the threshold of 6. And so looking at my code above, I want to produce 'i', because looking at the above data frame, the row containing "i" in column "A" correspond to a value of 8 in column "B", which is the last occurrence of a value >= the threshold of 6. I want to preserve the order of the rows as alphabetical referencing column "A". I am guessing this might have to do with somehow modifying the indexing in my code, specifically the array[:,1] component or the [0,0] component, but I am not sure how to specifically call for the last occurrence meeting my criteria. How can I modify my code to find the value in column "A" corresponding to the last occurrence of a value >= the threshold of 6 in column "B"?


Solution

  • To get the first occurrence, you can use idxmax:

    df.loc[df['B'].ge(6).idxmax()]
    

    Output:

    A    e
    B    6
    Name: 4, dtype: object
    

    For just the value in 'A':

    df.loc[df['B'].ge(6).idxmax(), 'A']
    

    Output: 'e'

    For the last, do the same on the reversed series:

    df.loc[df.loc[::-1,'B'].ge(6).idxmax()]
    

    Output:

    A    k
    B    8
    Name: 10, dtype: object
    
    df.loc[df.loc[::-1, 'B'].ge(6).idxmax(), 'A']
    

    Output: 'k'