pythonpandasdataframegroup-bydelete-row

delete some rows from a dataframe according to conditions


I have a pandas dataframe that contains reviews. And for each review, I have the different words with a specific score as below:

import pandas as pd
df = pd.DataFrame({
    "review_num": [2,2,2,1,1,1,1,1,3,3],
    "review": ["The second review","The second review","The second review",
               "This is the first review","This is the first review",
               "This is the first review","This is the first review",
               "This is the first review",'No Noo', 'No Noo'],
    "token_num":[1,2,3,1,2,3,4,5,1,2],
    "token":["The","second","review","This","is","the","first","review","No","Noo"],
    "score":[0.3,-0.6,0.4,0.5,0.6,0.7,-0.6,0.4,0.5,0.6]
})
print(df)

I need to delete the lines that present the max score for each review. They are defined by the code below:

token_max_score = df.groupby('review_num',sort=False)['score'].idxmax()

Any idea, please ?


Solution

  • If need remove first maximum values per groups use:

    df1 = df.drop(df.groupby('review_num',sort=False)['score'].idxmax())
    print (df1)
       review_num                    review  token_num   token  score
    0           2         The second review          1     The    0.3
    1           2         The second review          2  second   -0.6
    3           1  This is the first review          1    This    0.5
    4           1  This is the first review          2      is    0.6
    6           1  This is the first review          4   first   -0.6
    7           1  This is the first review          5  review    0.4
    8           3                    No Noo          1      No    0.5
    

    If need remove all maximus per groups here is ouput same:

    df2 = df[df['score'].ne(df.groupby('review_num',sort=False)['score'].transform('max'))]
    print (df2)
       review_num                    review  token_num   token  score
    0           2         The second review          1     The    0.3
    1           2         The second review          2  second   -0.6
    3           1  This is the first review          1    This    0.5
    4           1  This is the first review          2      is    0.6
    6           1  This is the first review          4   first   -0.6
    7           1  This is the first review          5  review    0.4
    8           3                    No Noo          1      No    0.5
    

    Difference is if duplicated values:

    df = pd.DataFrame({
        "review_num": [2,2,2,1,1,1,1,1,3,3],
        "review": ["The second review","The second review","The second review",
                   "This is the first review","This is the first review",
                   "This is the first review","This is the first review",
                   "This is the first review",'No Noo', 'No Noo'],
        "token_num":[1,2,3,1,2,3,4,5,1,2],
        "token":["The","second","review","This","is","the","first","review","No","Noo"],
        "score":[0.4,0.4,0.4,0.5,0.6,0.7,-0.6,0.4,0.5,0.6]
    })
    print(df)
       review_num                    review  token_num   token  score
    0           2         The second review          1     The    0.4 <-multiple max per group 2
    1           2         The second review          2  second    0.4 <-multiple max per group 2
    2           2         The second review          3  review    0.4 <-multiple max per group 2
    3           1  This is the first review          1    This    0.5
    4           1  This is the first review          2      is    0.6
    5           1  This is the first review          3     the    0.7
    6           1  This is the first review          4   first   -0.6
    7           1  This is the first review          5  review    0.4
    8           3                    No Noo          1      No    0.5
    9           3                    No Noo          2     Noo    0.6
    

    df1 = df.drop(df.groupby('review_num',sort=False)['score'].idxmax())
    print (df1)
       review_num                    review  token_num   token  score
    1           2         The second review          2  second    0.4
    2           2         The second review          3  review    0.4
    3           1  This is the first review          1    This    0.5
    4           1  This is the first review          2      is    0.6
    6           1  This is the first review          4   first   -0.6
    7           1  This is the first review          5  review    0.4
    8           3                    No Noo          1      No    0.5
    
    df2 = df[df['score'].ne(df.groupby('review_num',sort=False)['score'].transform('max'))]
    print (df2)
       review_num                    review  token_num   token  score
    3           1  This is the first review          1    This    0.5
    4           1  This is the first review          2      is    0.6
    6           1  This is the first review          4   first   -0.6
    7           1  This is the first review          5  review    0.4
    8           3                    No Noo          1      No    0.5