pythonpandasdataframegroup-by

Pandas time series dataframe take random samples by group (date) ignore missing dates


I have a time series dataframe, I would like to take x random samples from column "temperature" from each day. I am able to do this with:

daily_groups = df.groupby([pd.Grouper(key='time', freq='D')])['temperature'].apply(lambda x: x.sample(10))

This works if there are at least x samples for each day. If there are not I get the error "a must be greater than 0 unless no samples are taken". What I would like is, lets say I want 10 samples, get 10 if available, if not get as many as possible and if there are none skip this day. I don't want to up sample data.

Also I don't know if its possible to return the original dataframe with the values filtered for items mention above, instead of returning the groupby series object. Thanks for any help.


Solution

  • If you want to randomly sample values from the "temperature" column for each day, up to 10 values per day, but also want to handle days with fewer than 10 entries, here's my suggestion on how to do it.

    This code checks how many rows there are per day — if there are fewer than 10, it just takes as many as possible. If there are zero, it skips that day entirely. The best part is that it gives you back the original DataFrame rows, not the Series from Groupby.

    import pandas as pd
    
    def sample_temperature(group, n=10):
        if len(group) == 0:
            return pd.DataFrame()  # return empty if there is nothing in the group
        return group.sample(min(len(group), n))
    
    # Make sure the 'time' column is in datetime format
    df['time'] = pd.to_datetime(df['time'])
    
    # Group by day and sample
    sampled_df = (df.groupby(pd.Grouper(key='time', freq='D')).apply(lambda g: sample_temperature(g, n=10)).reset_index(drop=True))