time-seriesdata-sciencedata-cleaningdata-quality

How to detect and remove inconsistent timestamps in a time-series dataset?


I’m working with a time-series dataset where each record is supposed to be logged at 1-minute intervals.
However, due to data quality issues, the dataset contains:

These issues cause problems when I resample or build forecasting models.

Here’s the code I am using right now:
import pandas as pd

df = pd.read_csv("sensor.csv", parse_dates=["timestamp"])
df = df.sort_values("timestamp")

# Check duplicates
duplicates = df[df["timestamp"].duplicated()]

# Check gaps
df["diff"] = df["timestamp"].diff()
print(df["diff"].value_counts())

This helps me identify some issues, but I want a more systematic and scalable solution.

My questions:

  1. What’s the best way to detect missing timestamps and automatically fill or interpolate them?

  2. How can I handle out-of-order or irregular intervals efficiently for large time-series datasets?

  3. Are there Python libraries (e.g., tsfresh, river, statsmodels, or pandas built-ins) that help with automated time-series data quality validation?


Solution

    1. Detect missing timestamps and fill them
    import pandas as pd
    
    df = pd.read_csv("sensor.csv", parse_dates=["timestamp"])
    
    # 1. Sort + set index
    df = df.sort_values("timestamp").set_index("timestamp")
    
    # 2. Fix duplicates (either drop or aggregate)
    df = df[~df.index.duplicated(keep="first")]
    # or: df = df.groupby(level=0).mean()
    
    # 3. Put data on a strict 1-minute grid
    df_full = df.asfreq("1T")       # inserts missing timestamps as NaN rows
    
    # 4. See which timestamps were missing
    missing_ts = df_full[df_full.isna().all(axis=1)].index
    
    # 5. Fill or interpolate
    df_full["was_imputed"] = df_full.isna().any(axis=1)
    df_full = df_full.interpolate("time")   # or .ffill()
    

    Now:

    1. Handling irregular intervals for large data
    1. Libraries for automated checks