pythonvaex

Unexpected Output in Vaex Function


I have the following Vaex function I am trying to make:

@vaex.register_function(on_expression=True)
def getSumStatsByGroup(df, group, x):
    data = (df.groupby(by=group, agg={'Min' : vaex.agg.min(df[x]), 'Mean' : vaex.agg.mean(df[x]), 'Max' : vaex.agg.max(df[x]),
                              'Variance' : vaex.agg.var(df[x])}))
    return data

Although every time I run it I get really messy data resembling this:

 File <unknown>:2
    0           AR              2020-12-06 00:00:00.000000000  AR              Argentina       AR                    ARG                   0                    2176.0           150.0           1489103.0               43125.0                3699476.0            nan                       nan                              nan                             nan                                    nan                               nan                                      44938712.0    19523766.0         20593330.0           3599141.0           41339571.0          16.515                0.825                    

Although, when I manually fill in the parameters :

df.groupby(by='country_name', agg={'Min' : vaex.agg.min(df['new_confirmed']), 'Mean' : vaex.agg.mean(df['new_confirmed']), 'Max' : vaex.agg.max(df['new_confirmed']),
                              'Variance' : vaex.agg.var(df['new_confirmed'])})

The output is as expected. I have tried converting the return value to a pandas dataframe, calling print() on it, changing it to on_expression=False, getting rid of the return keyword in the function, but each time I get the exact same result. I am running this on jupyter notebook and very confused why it works when manually filling in parameters but not with the Vaex function. Any help or explanation is greatly appriciated!


Solution

  • I think you misunderstand how the @register_function decorator works and its intended use.

    The decorator applies the function per row on the dataframe. The expected arguments are one or more columns/expressions, or constants. The function will then take one row of data, and evaluate it and return the result. The function should return a single value (a sample, int, string, perhaps even a list, or a numpy array, I think some of those structures are supported by vaex). Basically the output should be a vaex expression (and groupby does not fit that because the output of that is a dataframe).

    This is useful because vaex will run this out-of-core, and in parallel, so you get some speedup. In some way it is similar to apply, but the idea is that for a particular project you can build your own in-house extensions to fit exactly your needs.

    I hope my explanation makes sense a bit. I think @register_functiion should be better documented on the vaex. In any case, here is a link to the docs.

    Perhaps for your usecase (if I understood it correctly) you might want to take a look at the custom dataframe accessor stuff