The dataframe looks like this: enter image description here
**I have converted the timestamp to datatime64 **
# Inspect the data
print(df['timestamp'].head(5))
print(df['timestamp'].dtype)
# Attempt conversion with specific format + Handle potential timezone issues
df['timestamp'] = pd.to_datetime(df['timestamp'], format='%Y-%m-%d %H:%M:%S%z', utc=True)
print(df['timestamp'].dtype)
# Check for missing or invalid data
df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
print(df['timestamp'].dtype)
# Convert to string representation
df['timestamp_str'] = df['timestamp'].astype(str)
df['timestamp'] = pd.to_datetime(df['timestamp_str'])
print(df['timestamp'].dtype)
Task
I need to calculate Median, SD and Dropout rate, without using Pandas function like median()
or std()
(it is stupid but its a homework). I think I need all data groupby
according to hourly, daily and monthly.
How do I get the all data for hourly, daily and monthly?
I have tried the below:
every_freq = df.groupby(df['timestamp'].dt.floor(freq)).first
Only the first row of each hour and day will be kept.
If you want to group your data by hours, days, or months using groupby
, you can try groupby
with pd.Grouper
. This allows you to maintain all your data within each group, rather than just keeping the first row like you would with .first()
.
# Grouping by hourly frequency
hourly_groups = df.groupby(pd.Grouper(key='timestamp', freq='H'))
# Grouping by daily frequency
daily_groups = df.groupby(pd.Grouper(key='timestamp', freq='D'))
# Grouping by monthly frequency
monthly_groups = df.groupby(pd.Grouper(key='timestamp', freq='M'))
An alternative would be to use resample
which is purposely designed for working with time series data, instead of groupby
which is more general-purpose.
hourly_data = df.resample('h', on='timestamp')
daily_data = df.resample('D', on='timestamp')
monthly_data = df.resample('ME', on='timestamp')
The code was tested on Pandas version 2.2.2.