pandasdataframeweighting

Iteratively remove unsorted rows on condition until defined dataframe size reached


How can I reduce the size of a dataframe, a row at a time based on a condition(partnersCount =1, & selectionWeighting in order - lowest to highest) until the dataframe is a specified size.

Psuedo:

ideally, id like to continue the above, so that:

Current approach is functional, but struggling to believe that this is the best way to achieve this. Data/df sizes are always <2k. Perhaps someone can recommend an alternative approach. Note: For reasons i wont go into, but applying a sort and removing from the bottom/top up is not an option.

import pandas as pd

# some pretend data
df = pd.DataFrame({'ID': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
                   'selectionWeighting': [1,1,.45,.45,.3,.3,.2, .2,.2,.2,.1,.1,0,0,0,1,1,.45,.3,.3],
                  'partnerID': [1,1,4,4,3,9,2, 2,11,2,1,1,0,7,0,1,1,4,3,1],
                  'partnersCount': [6,6,3,3,2,1,3, 3,1,3,6,6,2,1,2,6,6,3,2,1]}
                 
                 )
df.reset_index(drop=True, inplace=True)
df = df.sort_values(['selectionWeighting'], ascending=[False])

# init 
# in reality, these are variable and coming from elsewhere. 
targetDfSize = 18 # target size changes everytime. 
currentDfSize = df.shape[0]
difference = max(0,currentDfSize - targetDfSize)

if difference:
    for i in range(difference):
        canRemove = df[(df['partnersCount']==1) & (df['selectionWeighting']!=1) ] # get those prioritised for removal
        #display(canRemove)
        df = df[df['partnersCount']>1] # clean up before we put the remove rows back
        #display(df)
        if canRemove.shape[0]>0: 
            # so there are some single partners we can remove
            
            display(df)
            canRemove = canRemove.iloc[:-1] # drop 1, remove last row 

            df = df.append(canRemove, ignore_index=True) # append the remainder back, then go check if we're still too big
            print(df.shape[0])
df

Solution

  • I think there are a lot of expensive operations and maybe useless used in your example.

    If I understand well your example, you first remove all rows from canRemove with df = df[df['partnersCount']>1] and then append again all but one rows from canRemove with df = df.append(canRemove, ignore_index=True), why not removing only one row at a time ?

    Secondly, if you know how many rows to remove, why looping over ?

    I suggest this implementation :

    ## select all rows such that partnersCount is 1 and selectionWeighting is not 1
    can_remove = df.query("partnersCount == 1 and selectionWeighting != 1")
    
    ## select only the n last rows you can remove
    to_remove = can_remove.iloc[-difference :, :]
    
    ## construct a mask returning false if a row is in to_remove
    df_mask = df.ne(to_remove)
    
    ## new dataframe using the mask
    df = df[df_mask].dropna()
    

    For heavy data, I strongly advice you to use query to speed up your code, and to not use for loop with Pandas. There are a lot of functional optimization with Pandas which you can exploit.

    I hope I answered your question

    Edit to better answer op's question and removing useless query.