I'm trying to convert the answer from here into Vaex so I can plot a bar graph/histogram of dates from a dataframe. I tried different operations after the groupby like .sum()
etc but can't manage to get it working. Is there a better way of accomplishing this in Vaex?
You can use agg.sum()
(agg.count()
if you are counting). Here is an example with fictiv sales data. Note that I use pandas only to create a csv file to be read with vaex
:
import pandas as pd
import numpy as np
import vaex
np.random.seed(0)
dates = pd.date_range('20230101', periods=60)
data = {
'date': np.random.choice(dates, 500),
'product_id': np.random.choice(['A', 'B', 'C'], 500),
'quantity': np.random.randint(1, 10, 500),
'price_per_unit': np.random.uniform(10, 50, 500)
}
pdf = pd.DataFrame(data)
csv_file_path = 'sample_sales_data.csv'
pdf.to_csv(csv_file_path, index=False)
df = vaex.from_csv(csv_file_path, parse_dates=['date'])
df['total_sales'] = df['quantity'] * df['price_per_unit']
df['year_month'] = df.date.dt.strftime('%Y-%m')
result_product = df.groupby('product_id', agg={'total_sales_sum': vaex.agg.sum(df['total_sales'])})
result_month = df.groupby('year_month', agg={'total_sales_sum': vaex.agg.sum(df['total_sales'])})
result_product_df = result_product.to_pandas_df()
result_month_df = result_month.to_pandas_df()
result_product_df, result_month_df
which gives
( product_id total_sales_sum
0 B 23406.541203
1 A 23120.765300
2 C 24332.454628,
year_month total_sales_sum
0 2023-02 33218.240290
1 2023-01 36190.503868
2 2023-03 1451.016974)