I looked at similar answers but could not find one that suits my problem.
I have a daily covariance matrix that is constructed as a MultiIndex. It has "k" dates, and for each date "k" it has a matrix of size "n" by "n". The dimensions are technically (k, n, n) but as this is a MultiIndex pandas views this as shape (kxn, n).
As a minimum working example, I can provide the below:
dates = pd.date_range(start='20120101', end='20210101', freq='D')
X = pd.DataFrame( {'A' : np.random.rand(len(dates)), 'B' : np.random.rand(len(dates)), 'C' : np.random.rand(len(dates)), 'D' : np.random.rand(len(dates)), 'E' : np.random.rand(len(dates)) }, index=dates).ewm(halflife=30, min_periods=1).cov()
I would like to resample it from daily to minutely. Assuming my MultiIndex covariance matrix is called "X", I have managed to get the following to work:
X.unstack().resample("T").first().ffill().stack()
However, this takes an awfully long time to compute. Is there a faster and more efficient way to perform this operation?
This used to be quick with Panels that were deprecated after pandas 0.24. From my own profiling work the most memory intensive part seems to be the 'stack()'
IIUC
Using resample to add information ('D' -> 'T') is not a right choice especially if you want to fill forward values. You can use np.vsplit
to create a Panel-like then repeat your array according to your DatetimeIndex and finally reshape the data:
# Create new MultiIndex
dates2 = pd.date_range(X.index.levels[0].min(),
X.index.levels[0].max() + pd.DateOffset(days=1),
freq='T', closed='left')
mi = pd.MultiIndex.from_product([dates2, X.index.levels[1]])
# Manipulate your array
vals = np.array(np.repeat(np.vsplit(X.values, len(X.index.levels[0])), 24*60, axis=0))
vals = vals.reshape(vals.shape[0]*vals.shape[1], vals.shape[2])
# New dataframe
out = pd.DataFrame(vals, index=mi, columns=X.columns)
For a smaller sample:
>>> df
A B
2012-01-01 A 11 12
B 13 14
2012-01-02 A 21 22
B 23 24
2012-01-03 A 31 32
B 33 34
# Resample: 12H and 2 values per day
# dates2 = pd.date_range(df.index.levels[0].min(), df.index.levels[0].max() + pd.DateOffset(days=1), freq='12H', closed='left')
# mi = pd.MultiIndex.from_product([dates2, df.index.levels[1]])
# vals = np.array(np.repeat(np.vsplit(df.values, len(df.index.levels[0])), 2, axis=0))
# vals = vals.reshape(vals.shape[0]*vals.shape[1], vals.shape[2])
# out = pd.DataFrame(vals, index=mi, columns=df.columns)
>>> out
A B
2012-01-01 00:00:00 A 11 12
B 13 14
2012-01-01 12:00:00 A 11 12
B 13 14
2012-01-02 00:00:00 A 21 22
B 23 24
2012-01-02 12:00:00 A 21 22
B 23 24
2012-01-03 00:00:00 A 31 32
B 33 34
2012-01-03 12:00:00 A 31 32
B 33 34
With your code:
>>> df.unstack().resample("12H").first().ffill().stack()
A B
2012-01-01 00:00:00 A 11.0 12.0
B 13.0 14.0
2012-01-01 12:00:00 A 11.0 12.0
B 13.0 14.0
2012-01-02 00:00:00 A 21.0 22.0
B 23.0 24.0
2012-01-02 12:00:00 A 21.0 22.0
B 23.0 24.0
2012-01-03 00:00:00 A 31.0 32.0
B 33.0 34.0
# <- Lost 2012-01-03 12:00:00
Performance on X
>>> %timeit op_resample()
9.1 s ± 568 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit new_array()
1.86 s ± 23 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)