How can I reduce the size of a dataframe, a row at a time based on a condition(partnersCount =1, & selectionWeighting in order - lowest to highest) until the dataframe is a specified size.
Psuedo:
ideally, id like to continue the above, so that:
Current approach is functional, but struggling to believe that this is the best way to achieve this. Data/df sizes are always <2k. Perhaps someone can recommend an alternative approach. Note: For reasons i wont go into, but applying a sort and removing from the bottom/top up is not an option.
import pandas as pd
# some pretend data
df = pd.DataFrame({'ID': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
'selectionWeighting': [1,1,.45,.45,.3,.3,.2, .2,.2,.2,.1,.1,0,0,0,1,1,.45,.3,.3],
'partnerID': [1,1,4,4,3,9,2, 2,11,2,1,1,0,7,0,1,1,4,3,1],
'partnersCount': [6,6,3,3,2,1,3, 3,1,3,6,6,2,1,2,6,6,3,2,1]}
)
df.reset_index(drop=True, inplace=True)
df = df.sort_values(['selectionWeighting'], ascending=[False])
# init
# in reality, these are variable and coming from elsewhere.
targetDfSize = 18 # target size changes everytime.
currentDfSize = df.shape[0]
difference = max(0,currentDfSize - targetDfSize)
if difference:
for i in range(difference):
canRemove = df[(df['partnersCount']==1) & (df['selectionWeighting']!=1) ] # get those prioritised for removal
#display(canRemove)
df = df[df['partnersCount']>1] # clean up before we put the remove rows back
#display(df)
if canRemove.shape[0]>0:
# so there are some single partners we can remove
display(df)
canRemove = canRemove.iloc[:-1] # drop 1, remove last row
df = df.append(canRemove, ignore_index=True) # append the remainder back, then go check if we're still too big
print(df.shape[0])
df
I think there are a lot of expensive operations and maybe useless used in your example.
If I understand well your example, you first remove all rows from canRemove with df = df[df['partnersCount']>1]
and then append again all but one rows from canRemove with df = df.append(canRemove, ignore_index=True)
, why not removing only one row at a time ?
Secondly, if you know how many rows to remove, why looping over ?
I suggest this implementation :
## select all rows such that partnersCount is 1 and selectionWeighting is not 1
can_remove = df.query("partnersCount == 1 and selectionWeighting != 1")
## select only the n last rows you can remove
to_remove = can_remove.iloc[-difference :, :]
## construct a mask returning false if a row is in to_remove
df_mask = df.ne(to_remove)
## new dataframe using the mask
df = df[df_mask].dropna()
For heavy data, I strongly advice you to use query to speed up your code, and to not use for loop with Pandas. There are a lot of functional optimization with Pandas which you can exploit.
I hope I answered your question
Edit to better answer op's question and removing useless query.