pythonpandasvaex

Date distribution histogram in Vaex


I'm trying to convert the answer from here into Vaex so I can plot a bar graph/histogram of dates from a dataframe. I tried different operations after the groupby like .sum() etc but can't manage to get it working. Is there a better way of accomplishing this in Vaex?


Solution

  • You can use agg.sum() (agg.count() if you are counting). Here is an example with fictiv sales data. Note that I use pandas only to create a csv file to be read with vaex:

    import pandas as pd
    import numpy as np
    import vaex
    
    np.random.seed(0)
    dates = pd.date_range('20230101', periods=60) 
    data = {
        'date': np.random.choice(dates, 500),
        'product_id': np.random.choice(['A', 'B', 'C'], 500),
        'quantity': np.random.randint(1, 10, 500),
        'price_per_unit': np.random.uniform(10, 50, 500)
    }
    pdf = pd.DataFrame(data)
    
    csv_file_path = 'sample_sales_data.csv'
    pdf.to_csv(csv_file_path, index=False)
    
    df = vaex.from_csv(csv_file_path, parse_dates=['date'])
    df['total_sales'] = df['quantity'] * df['price_per_unit']
    df['year_month'] = df.date.dt.strftime('%Y-%m')
    result_product = df.groupby('product_id', agg={'total_sales_sum': vaex.agg.sum(df['total_sales'])})
    result_month = df.groupby('year_month', agg={'total_sales_sum': vaex.agg.sum(df['total_sales'])})
    
    result_product_df = result_product.to_pandas_df()
    result_month_df = result_month.to_pandas_df()
    
    result_product_df, result_month_df
    
    

    which gives

    (  product_id  total_sales_sum
     0          B     23406.541203
     1          A     23120.765300
     2          C     24332.454628,
       year_month  total_sales_sum
     0    2023-02     33218.240290
     1    2023-01     36190.503868
     2    2023-03      1451.016974)