time-series data-science data-cleaning data-quality

How to detect and remove inconsistent timestamps in a time-series dataset?

I’m working with a time-series dataset where each record is supposed to be logged at 1-minute intervals.
However, due to data quality issues, the dataset contains:

duplicated timestamps
missing timestamps
irregular gaps (e.g., jumps of 5–10 minutes)
out-of-order rows

These issues cause problems when I resample or build forecasting models.

Here’s the code I am using right now:
import pandas as pd

df = pd.read_csv("sensor.csv", parse_dates=["timestamp"])
df = df.sort_values("timestamp")

# Check duplicates
duplicates = df[df["timestamp"].duplicated()]

# Check gaps
df["diff"] = df["timestamp"].diff()
print(df["diff"].value_counts())

This helps me identify some issues, but I want a more systematic and scalable solution.

My questions:

What’s the best way to detect missing timestamps and automatically fill or interpolate them?
How can I handle out-of-order or irregular intervals efficiently for large time-series datasets?
Are there Python libraries (e.g., tsfresh, river, statsmodels, or pandas built-ins) that help with automated time-series data quality validation?

Solution

Detect missing timestamps and fill them

import pandas as pd

df = pd.read_csv("sensor.csv", parse_dates=["timestamp"])

# 1. Sort + set index
df = df.sort_values("timestamp").set_index("timestamp")

# 2. Fix duplicates (either drop or aggregate)
df = df[~df.index.duplicated(keep="first")]
# or: df = df.groupby(level=0).mean()

# 3. Put data on a strict 1-minute grid
df_full = df.asfreq("1T")       # inserts missing timestamps as NaN rows

# 4. See which timestamps were missing
missing_ts = df_full[df_full.isna().all(axis=1)].index

# 5. Fill or interpolate
df_full["was_imputed"] = df_full.isna().any(axis=1)
df_full = df_full.interpolate("time")   # or .ffill()

Now:

no out-of-order rows (we sorted),
no duplicates (we handled them),
every minute exists (asfreq),
you explicitly see and control how gaps are filled.

Handling irregular intervals for large data

Always use a DatetimeIndex and sort_values.

Use vectorized ops only:

df = df.sort_values("timestamp").set_index("timestamp")

gaps = df.index.to_series().diff()
bad_gaps = df[gaps != pd.Timedelta(minutes=1)]

If the file is huge: load in chunks, clean/sort per chunk, then concatenate and run the same asfreq("1T") logic once.

Libraries for automated checks

pandas: main workhorse (asfreq, reindex, diff, resample).
pandera: define a schema (no duplicates, monotonic time, allowed missing rate) and call schema.validate(df_full).
great_expectations: similar idea, more “data pipeline” oriented.