pythondataframevaex

Multi-columns filter VAEX dataframe, apply expression and save result


I want to use VAEX for lazy work wih my dataframe. After quick start with export big csv and some simple filters and extract() I have initial df for my work with 3 main columns: cid1, cid2, cval1. Each combitations of cid1 and cid2 is a workset with some rows where is cval1 is different. My df contents only valid cid1 and cid2. I want to save in df only rows with minimun cval1 and drop other. cval1 is float, cid1 and cid2 is int.

I try one filter:

df = df.filter(df.cid1 == 36 & df.cid2 == 182 & df.cval1 == df.min(df.cval1))

I must to receive in result df with only one row. But it not work properly, it's result: enter image description here

It's a first problem. But next I must to find minimum cval1 for each valid combination of cid1 and cid2.

I have list of tuples with each values cid1 and cid2:

cart_prod=[(2, 5), (3, 9), ...]

I think I try:

df_finally = vaex.DataFrame()
for x in cart_prod:
   df2 = df.filter(df.cid1 == x[0] & df.cid2 == x[1] & df.cval1 == df.min(df.cval1))
   df_finally = vaex.concat([df_finally, df2])

But the filter not valid, and VAEX can not concat with error that DataFrame have not attribute concat.. But I try really vaex.concat(list_of_dataframes).

I think may be I can use:

df.select(df.cid1 == x[0] & df.cid2 == x[1] & df.cval1 == df.min(df.cval1), name = "selection")

But I can't to make that df this selection take and use..

df = df.filter((df.cid1, df.cid2) in cart_prod)

This code have not result too..

Hmmm.. Help me please!

How to choose minimum df.cval1 for each combinations of df.cid1 and df.cid2 and save result to dataframe in VAEX?

Maybe goupby? But I don't understand how it works..


Solution

  • I've not used VAEX but the documentation says its groupby syntax is virtually same as pandas. So, here is what I would do in Pandas:

    import pandas as pd
    import numpy as np
    
    
    df["min_cid3"] = df.groupby(['cid1', 'cid2'])['cid3'].transform(np.min)
    

    Then filter your df wherever cid3==min_cid3.

    EDIT: As per OP's comment, above pandas solution is working but fails for VAEX. So, based on VAEX docs, I believe this would work there:

    df.groupby(by=['cid1', 'cid2']).agg({'min_cid3': 'min'})
    

    PS: I haven't installed VAEX, so if this doesn't work and you figure out the change needed, feel free to suggest edit.