I’m working with a time-series dataset where each record is supposed to be logged at 1-minute intervals.
However, due to data quality issues, the dataset contains:
duplicated timestamps
missing timestamps
irregular gaps (e.g., jumps of 5–10 minutes)
out-of-order rows
These issues cause problems when I resample or build forecasting models.
Here’s the code I am using right now:
import pandas as pd
df = pd.read_csv("sensor.csv", parse_dates=["timestamp"])
df = df.sort_values("timestamp")
# Check duplicates
duplicates = df[df["timestamp"].duplicated()]
# Check gaps
df["diff"] = df["timestamp"].diff()
print(df["diff"].value_counts())
This helps me identify some issues, but I want a more systematic and scalable solution.
My questions:
What’s the best way to detect missing timestamps and automatically fill or interpolate them?
How can I handle out-of-order or irregular intervals efficiently for large time-series datasets?
Are there Python libraries (e.g., tsfresh, river, statsmodels, or pandas built-ins) that help with automated time-series data quality validation?
import pandas as pd
df = pd.read_csv("sensor.csv", parse_dates=["timestamp"])
# 1. Sort + set index
df = df.sort_values("timestamp").set_index("timestamp")
# 2. Fix duplicates (either drop or aggregate)
df = df[~df.index.duplicated(keep="first")]
# or: df = df.groupby(level=0).mean()
# 3. Put data on a strict 1-minute grid
df_full = df.asfreq("1T") # inserts missing timestamps as NaN rows
# 4. See which timestamps were missing
missing_ts = df_full[df_full.isna().all(axis=1)].index
# 5. Fill or interpolate
df_full["was_imputed"] = df_full.isna().any(axis=1)
df_full = df_full.interpolate("time") # or .ffill()
Now:
no out-of-order rows (we sorted),
no duplicates (we handled them),
every minute exists (asfreq),
you explicitly see and control how gaps are filled.
Always use a DatetimeIndex and sort_values.
Use vectorized ops only:
df = df.sort_values("timestamp").set_index("timestamp")
gaps = df.index.to_series().diff()
bad_gaps = df[gaps != pd.Timedelta(minutes=1)]
If the file is huge: load in chunks, clean/sort per chunk, then concatenate and run the same asfreq("1T") logic once.
pandas: main workhorse (asfreq, reindex, diff, resample).
pandera: define a schema (no duplicates, monotonic time, allowed missing rate) and call schema.validate(df_full).
great_expectations: similar idea, more “data pipeline” oriented.