pythonpandasnumpy

Obtain Missing Years of Vehicle's Timeline


I have a 'pandas' data-frame containing timeline year recorded observations for vehicles. The data is organised in a particular way which is explained more below: A sample of such a data-frame for one vehicle is shown here:

import pandas as pd
import numpy as np

df1 = pd.DataFrame({"Vehicle Type": ["truck", "truck", "truck", "truck", "truck"],
  "Vehicle ID": ["XYZ", "XYZ", "XYZ", "XYZ", "XYZ"],
  "Year": [1, 2, 3, 4, 5],
  "Earliest Fact": [pd.NaT, pd.NaT, "2018-04-18", "2019-01-02", "2020-01-02"],
  "Latest Fact": [pd.NaT, pd.NaT, "2019-01-01", "2020-01-01", "2020-12-31"],
  "Fact History": [np.nan, np.nan, 11.5, 11.7, 5],
  "Days Worked": [np.nan, np.nan, 234, 256, 43],
  "Days Available": [np.nan,  np.nan, 260, 272, 57]
})

df[["Earliest Fact", "Latest Fact"]] = 
df[["Earliest Fact", "Latest Fact"]].apply(pd.to_datetime, errors="coerce")

For this particular vehicle, it has history beginning in its 3rd year up till its 5th year - you'll notice this because the value for Fact History is less than 12 (less than 12 months worth of history). Its 5th year corresponds to its current year where data is still being recorded for it.

It has no history in years 1 and 2.

Other vehicles in my data-frame are in similar situations whereby they will have missing history in some years.

Another situation will be something like this:

df2 = pd.DataFrame({"Vehicle Type": ["van", "van", "van", "van", "van", "van", "van"],
  "Vehicle ID": ["ABC", "ABC", "ABC", "ABC", "ABC", "ABC", "ABC"],
  "Year": [1, 2, 3, 4, 5, 6, 7],
  "Earliest Fact": [pd.NaT, pd.NaT, "2018-04-18", "2019-01-02", "2020-01-02", "2021-01-01",    "2022-01-01"],
  "Latest Fact": [pd.NaT, pd.NaT, "2019-01-01", "2020-01-01", "2020-12-31", "2021-12-31", "2023-01-01"],
  "Fact History": [np.nan, np.nan, 5, 11.7, 12, 12, 5.7],
  "Days Worked": [np.nan, np.nan, 100, 256, 273, 300, 94],
  "Days Available": [np.nan,  np.nan, 130, 272, 290, 320, 141]
})

Here, this vehicle only has a partially recorded year in its first recorded year (its 3rd month, with 5 months worth of history). Its 7th year corresponds to its current year where data is still being recorded for it.

I need some systematic way to obtain the missing years for each vehicles so that I can fill them in with a function I am creating.

I had this solution but the code for it is far too verbose:

# concatenate dfs from above into one data-frame
df = pd.concat([df1, df], axis = "index")

full = df[["Vehicle ID", "Year", "Fact History"]].loc[df["Fact History"].notnull()]

reference = pd.merge(
    full.groupby("Vehicle ID")["Year"].min().reset_index().rename(columns={"Year": "Begins At"}),
    full.groupby("Vehicle ID")["Year"].max().reset_index().rename(columns={"Year": "Current Year"}),
    on="Vehicle ID"
)

reference = pd.merge(
    reference,
    full.rename(columns={"Year": "Begins At", "Fact History": "History Begins"}),
    on=["Vehicle ID", "Begins At"],
    how="left"
)

reference["Missing Years"] = reference.apply(
    lambda row: ", ".join(map(str, range(1, row["Begins At"] + 1))) 
    if row["History Begins"] < 10 
    else np.nan,
    axis=1
)

so the missing year values should correspond to where:

the missing year values CANNOT correspond to the vehicle's current year where there will obviously be only partially recorded history

If there is a smoother way to do this, I would appreciate the help


Solution

  • I was a little unclear on exactly what you want to achieve, but if I understand correctly then this might help:

    # Because I see 11.7 in the "Fact History" column, this cutoff is used to decide if a month was partial or not
    MINIMUM_MONTHS_WORKED_IN_FULL_YEAR = 11
    
    df_worked_years = df2.query("not `Fact History`.isna()")
    row_earliest_year = df_worked_years.query("`Year` == `Year`.min()").iloc[0]
    earliest_year_is_partial_year = bool(row_earliest_year["Fact History"] < MINIMUM_MONTHS_WORKED_IN_FULL_YEAR)
    
    earliest_year = row_earliest_year["Year"]
    year_cutoff = earliest_year - (0 if earliest_year_is_partial_year else 1)
    all_years = df2["Year"].tolist()
    incomplete_years = [year for year in all_years if year <= year_cutoff]
    
    print(incomplete_years)
    

    Basically: