I'm looking to create a variance/covariance matrix or a correlation matrix (I don't care which) on stock data.
The issue is that different symbols trade at different times, so my dataframe is not full.
How would I generate a covariance matrix on a dataset that has a timestamp column, and where some entries are null.
Here would be an example.
timestamp usdjpy AAPL NIKKEI
0 2021-01-01 00:00:00 4 NaN NaN
1 2021-01-01 00:05:00 5 5 NaN
2 2021-01-01 00:10:00 6 6 9
3 2021-01-01 00:15:00 4 2 4
4 2021-01-01 00:20:00 2 NaN 3
5 2021-01-01 00:25:00 7 NaN 7
What I would want here is the covariance between usdjpy and aapl to be computed on their overlapping timestamps.
Just use cov
, this will ignore the NaNs by default:
out = df.drop(columns=['timestamp']).cov()
Output:
usdjpy AAPL NIKKEI
usdjpy 3.066667 2.000000 5.250000
AAPL 2.000000 4.333333 10.000000
NIKKEI 5.250000 10.000000 7.583333
If you just want to compute the covariance between the tickers and usdjpy:
out = df[['AAPL', 'NIKKEI']].apply(lambda s: s.cov(df['usdjpy']))
# equivalent to
# df[['AAPL', 'NIKKEI']].apply(lambda s: s.dropna().cov(df['usdjpy']))
Output:
AAPL 2.00
NIKKEI 5.25
dtype: float64
For the correlation, use corr
:
df.drop(columns=['timestamp']).corr()
usdjpy AAPL NIKKEI
usdjpy 1.000000 0.960769 0.859793
AAPL 0.960769 1.000000 1.000000
NIKKEI 0.859793 1.000000 1.000000
df[['AAPL', 'NIKKEI']].corrwith(df['usdjpy'], axis=0)
AAPL 0.960769
NIKKEI 0.859793
dtype: float64
Excerpt of the cov
documentation on NaNs:
Both NA and null values are automatically excluded from the calculation. (See the note below about bias from missing values.) A threshold can be set for the minimum number of observations for each value created. Comparisons with observations below this threshold will be returned as NaN.
Notes
For DataFrames that have Series that are missing data (assuming that data is missing at random) the returned covariance matrix will be an unbiased estimate of the variance and covariance between the member Series.
However, for many applications this estimate may not be acceptable because the estimate covariance matrix is not guaranteed to be positive semi-definite. This could lead to estimate correlations having absolute values which are greater than one, and/or a non-invertible covariance matrix. See Estimation of covariance matrices for more details.