pythonpandasdataframe

How to get data hourly, daily and monthly


I have data from https://data.melbourne.vic.gov.au/api/explore/v2.1/catalog/datasets/sensor-readings-with-temperature-light-humidity-every-5-minutes-at-8-locations-t/exports/csv?lang=en&timezone=Europe%2FHelsinki&use_labels=true&delimiter=%2C

The dataframe looks like this: enter image description here

**I have converted the timestamp to datatime64 **

#  Inspect the data
print(df['timestamp'].head(5))
print(df['timestamp'].dtype)

# Attempt conversion with specific format + Handle potential timezone issues
df['timestamp'] = pd.to_datetime(df['timestamp'], format='%Y-%m-%d %H:%M:%S%z', utc=True)
print(df['timestamp'].dtype)

# Check for missing or invalid data
df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
print(df['timestamp'].dtype)

#  Convert to string representation
df['timestamp_str'] = df['timestamp'].astype(str)
df['timestamp'] = pd.to_datetime(df['timestamp_str'])
print(df['timestamp'].dtype)

Task

I need to calculate Median, SD and Dropout rate, without using Pandas function like median() or std() (it is stupid but its a homework). I think I need all data groupby according to hourly, daily and monthly.

How do I get the all data for hourly, daily and monthly?

I have tried the below:

every_freq = df.groupby(df['timestamp'].dt.floor(freq)).first 

Only the first row of each hour and day will be kept.


Solution

  • If you want to group your data by hours, days, or months using groupby, you can try groupby with pd.Grouper. This allows you to maintain all your data within each group, rather than just keeping the first row like you would with .first().

    # Grouping by hourly frequency
    hourly_groups = df.groupby(pd.Grouper(key='timestamp', freq='H'))
    
    # Grouping by daily frequency
    daily_groups = df.groupby(pd.Grouper(key='timestamp', freq='D'))
    
    # Grouping by monthly frequency
    monthly_groups = df.groupby(pd.Grouper(key='timestamp', freq='M'))
    

    An alternative would be to use resample which is purposely designed for working with time series data, instead of groupby which is more general-purpose.

    hourly_data = df.resample('h', on='timestamp')
    daily_data = df.resample('D', on='timestamp')
    monthly_data = df.resample('ME', on='timestamp')
    

    The code was tested on Pandas version 2.2.2.