pythonpandaspint

Dataframe zeros out quantities when a date time is used for an index


Whenever set my DataFrame index to a list of datetime's, the Dataframe zeros out all my quantities.

df = pd.DataFrame(
    {
        "code": pd.Series(mem_code, dtype="pint[byte]"),
        "data": pd.Series(mem_data, dtype="pint[byte]"),
        "heap": pd.Series(mem_heap, dtype="pint[byte]"),
        "stack": pd.Series(mem_stack, dtype="pint[byte]"),
        "total": pd.Series(mem_total, dtype="pint[byte]"),
    },
    index=pd.DatetimeIndex(dt)
)

print(f"df: {df.pint.dequantify()}")
df:                     code data heap stack total
unit                   B    B    B     B     B
2017-01-01 12:00:00  NaN  NaN  NaN   NaN   NaN
2017-01-02 12:00:00  NaN  NaN  NaN   NaN   NaN
2017-01-03 12:00:00  NaN  NaN  NaN   NaN   NaN

I'm new to pandas and pint. My goal here is to use quantities (for units) rather than multiplying manually. In otherwords, the units should be used in the view, not be baked in to the model data.

I've been following examples such as How can I manage units in pandas data?

Nothing I've tried can get both the datetime indexes and the bytes quantities to work.

Am I doing something wrong or is pint_pandas still in its early days?


Solution

  • Since you are using pd.Series in the constructor of your DataFrame, the index of the series have to match the index of the dataframe. Your series have integer indexes (i.e. [0, 1, 2, ...]) whereas you define a datetimeindex as the df's index. None will match, so you'll get a df of NaN.

    There many ways to address this.

    You can, for instance, use a regular RangeIndex (the default behavior)

    df = pd.DataFrame(
        {
            "code": pd.Series(mem_code, dtype="pint[byte]"),
            "data": pd.Series(mem_data, dtype="pint[byte]"),
            "heap": pd.Series(mem_heap, dtype="pint[byte]"),
            "stack": pd.Series(mem_stack, dtype="pint[byte]"),
            "total": pd.Series(mem_total, dtype="pint[byte]"),
        }
    )
    

    and overwrite the index later

    df.index = pd.DateTimeIndex(dt)
    

    You can also use lists in the dict values to avoid index-matching:

    df = pd.DataFrame(
        {
            "code": pd.Series(mem_code, dtype="pint[byte]").tolist(),
            "data": pd.Series(mem_data, dtype="pint[byte]").tolist(),
            "heap": pd.Series(mem_heap, dtype="pint[byte]").tolist(),
            "stack": pd.Series(mem_stack, dtype="pint[byte]").tolist(),
            "total": pd.Series(mem_total, dtype="pint[byte]").tolist(),
        },
        index=pd.DatetimeIndex(dt)
    )
    

    Or, you can create all your series with the same indexes

    index = pd.DatetimeIndex(dt)
    df = pd.DataFrame(
        {
            "code": pd.Series(mem_code, dtype="pint[byte]", index=index),
            "data": pd.Series(mem_data, dtype="pint[byte]", index=index),
            "heap": pd.Series(mem_heap, dtype="pint[byte]", index=index),
            "stack": pd.Series(mem_stack, dtype="pint[byte]", index=index),
            "total": pd.Series(mem_total, dtype="pint[byte]", index=index),
        },
        index=index
    )