I am trying to find a way to 'pivot' my pandas dataframe, but keeping my index by sliced dates. The end goal is to create a range for each index in which each attributes and their values are matched.
I reached the expected output using for loops and other non-vectorized ways, but I would be looking for a vectorized solution since my input dataframe might be quite big.
I am using python 3.11 and pandas>=2.0.0.
Here is an input example :
index attribute start_date end_date value
0 index_1 attribute_1 2022-01-01 2022-02-01 1
1 index_1 attribute_1 2022-02-01 2023-01-01 2
2 index_1 attribute_2 2022-01-01 2023-01-01 3
3 index_2 attribute_3 2022-01-01 2023-01-01 4
4 index_3 attribute_4 2022-01-01 2023-01-01 5
What I am trying to obtain is this :
index start_date end_date attribute_1 attribute_2 attribute_3 attribute_4
0 index_1 2022-01-01 2022-02-01 1 3 None None
1 index_1 2022-02-01 2023-01-01 2 3 None None
2 index_2 2022-01-01 2023-01-01 None None 4 None
3 index_3 2022-01-01 2023-01-01 None None None 5
Here is a dictionary to reproduce the input dataframe :
from datetime import datetime
{
"index": ["index_1", "index_1", "index_1", "index_2", "index_3"],
"attribute": ["attribute_1", "attribute_1", "attribute_2", "attribute_3", "attribute_4"],
"start_date": [datetime(2022, 1, 1), datetime(2022, 2, 1), datetime(2022, 1, 1), datetime(2022, 1, 1), datetime(2022, 1, 1)],
"end_date": [datetime(2022, 2, 1), datetime(2023, 1, 1), datetime(2023, 1, 1), datetime(2023, 1, 1), datetime(2023, 1, 1)],
"value": [1, 2, 3, 4, 5]
}
To begin with, two modifications I had to bring to your input data for proper processing:
Refrain from using index
for naming variables, or columns names in your case. Potential conflicts calling df.index
later. To avoid confusions, I have named your 'index' with high caps: 'INDEX'.
For example: According to your input, index_1's attribute_1 on day 2022-02-01 takes 2 values: both 1 AND 2 at the same time. That is bound to cause unecessary problems. Hence I have made end_date
an end of month, instead of the first day of the next, as in subtract one day from a pandas dataframe date column:
# Removing one day from end_dates
df['end_date'] = df['end_date'] - pd.tseries.offsets.Day()
Alternatively you could:
INDEX attribute start_date end_date value
0 INDEX_1 attribute_1 2022-01-01 2022-01-31 1
1 INDEX_1 attribute_1 2022-02-01 2022-12-31 2
2 INDEX_1 attribute_2 2022-01-01 2022-12-31 3
3 INDEX_2 attribute_3 2022-01-01 2022-12-31 4
4 INDEX_3 attribute_4 2022-01-01 2022-12-31 5
Please confirm that those changes are ok with you. Now moving on to the actual solution:
Pivot attributes from index to columns
Reshape dataframe into time series, as in Convert dataframe with start and end date to daily data in 2 steps:
2a. Merge the tuples (start_date, end_date) into time spans,
2b. Then explode the time spans into full daily time series,
Aggregate the time series into the requested form
3a. Merge data corresponding to each day: group by INDEX + days
3b. Then recreate the relevant start and end dates: group by INDEX + attributes, aggregating timespans by min-max days, as in Max and Min date in pandas groupby
df_pivot = df.pivot_table(values ='value',
columns = 'attribute',
index = ['INDEX','start_date','end_date']).reset_index()
attribute INDEX start_date end_date attribute_1 attribute_2 attribute_3 attribute_4
0 INDEX_1 2022-01-01 2022-01-31 1.0 NaN NaN NaN
1 INDEX_1 2022-01-01 2022-12-31 NaN 3.0 NaN NaN
2 INDEX_1 2022-02-01 2022-12-31 2.0 NaN NaN NaN
3 INDEX_2 2022-01-01 2022-12-31 NaN NaN 4.0 NaN
4 INDEX_3 2022-01-01 2022-12-31 NaN NaN NaN 5.0
df_pivot['timespan'] = df_pivot.apply(lambda row: pd.date_range(row['start_date'], row['end_date']), axis=1)
attribute INDEX start_date end_date attribute_1 attribute_2 attribute_3 attribute_4 timespan
0 INDEX_1 2022-01-01 2022-01-31 1.0 NaN NaN NaN DatetimeIndex(['2022-01-01', '2022-01-02', '20...
1 INDEX_1 2022-01-01 2022-12-31 NaN 3.0 NaN NaN DatetimeIndex(['2022-01-01', '2022-01-02', '20...
2 INDEX_1 2022-02-01 2022-12-31 2.0 NaN NaN NaN DatetimeIndex(['2022-02-01', '2022-02-02', '20...
3 INDEX_2 2022-01-01 2022-12-31 NaN NaN 4.0 NaN DatetimeIndex(['2022-01-01', '2022-01-02', '20...
4 INDEX_3 2022-01-01 2022-12-31 NaN NaN NaN 5.0 DatetimeIndex(['2022-01-01', '2022-01-02', '20...
TS = df_pivot.explode('timespan', ignore_index=True)
.drop(columns=['start_date', 'end_date'])
attribute INDEX attribute_1 attribute_2 attribute_3 attribute_4 timespan
0 INDEX_1 1.0 NaN NaN NaN 2022-01-01
1 INDEX_1 1.0 NaN NaN NaN 2022-01-02
2 INDEX_1 1.0 NaN NaN NaN 2022-01-03
3 INDEX_1 1.0 NaN NaN NaN 2022-01-04
4 INDEX_1 1.0 NaN NaN NaN 2022-01-05
... ... ... ... ... ... ...
1455 INDEX_3 NaN NaN NaN 5.0 2022-12-27
1456 INDEX_3 NaN NaN NaN 5.0 2022-12-28
1457 INDEX_3 NaN NaN NaN 5.0 2022-12-29
1458 INDEX_3 NaN NaN NaN 5.0 2022-12-30
1459 INDEX_3 NaN NaN NaN 5.0 2022-12-31
At this point, one day still appears several times within each INDEX. You will see if you plot TS.plot(y='timespan')
TS = TS.groupby(by=['INDEX','timespan'])
.max()
.reset_index()
attribute INDEX timespan attribute_1 attribute_2 attribute_3 attribute_4
0 INDEX_1 2022-01-01 1.0 3.0 NaN NaN
1 INDEX_1 2022-01-02 1.0 3.0 NaN NaN
2 INDEX_1 2022-01-03 1.0 3.0 NaN NaN
3 INDEX_1 2022-01-04 1.0 3.0 NaN NaN
4 INDEX_1 2022-01-05 1.0 3.0 NaN NaN
... ... ... ... ... ... ...
1090 INDEX_3 2022-12-27 NaN NaN NaN 5.0
1091 INDEX_3 2022-12-28 NaN NaN NaN 5.0
1092 INDEX_3 2022-12-29 NaN NaN NaN 5.0
1093 INDEX_3 2022-12-30 NaN NaN NaN 5.0
1094 INDEX_3 2022-12-31 NaN NaN NaN 5.0
Now each day appears only once per INDEX.
Choice of agg function: We took care that there was only ever one possible value per day per attribute per INDEX, so any agg function could do. However, aggregating by either max()
(or min()
) will preserve the NaNs. Unlike sum()
, which converts them to zeroes.
df_final = TS.groupby(by=['INDEX','attribute_1','attribute_2','attribute_3','attribute_4'], dropna=False)
.agg({'timespan' : [min, max]})
.reset_index()
attribute INDEX attribute_1 attribute_2 attribute_3 attribute_4 timespan
min max
0 INDEX_1 1.0 3.0 NaN NaN 2022-01-01 2022-01-31
1 INDEX_1 2.0 3.0 NaN NaN 2022-02-01 2022-12-31
2 INDEX_2 NaN NaN 4.0 NaN 2022-01-01 2022-12-31
3 INDEX_3 NaN NaN NaN 5.0 2022-01-01 2022-12-31
Note that dropna=False
as groupby option, as in pandas GroupBy columns with NaN (missing) values, allows NaNs in the grouper, i.e. "NaN ==NaN" being false does not ruin the grouping.
Basically we have your resquested table here. Finally we could discuss bringing the final touch, such as
Potential bug? Step 3b: I wonder, if an identical combination of attributes were to repeat itself at different time spans, then the min/max approach should fail to reconstitute proper start_date and end_date. I believe this might be something to take care of.