rtime-seriesanomaly-detection

OHCL Time Series - Anomaly Detection with Multivariate Gaussian Distribution


I have a OHLC time series for some stock prices:

library(quantmod)
library(mnormt)
library(MASS)

download.file("http://dl.dropbox.com/u/25747565/941.RData", destfile="test.RData")
load("test.RData")
chartSeries(p)

941 OHLC time series

As you can see from the plot, there are two downward spikes, most likely due to some sort of data error. I want to use a multivariate Gaussian to detect the rows which contain these two offending data points.

> x[122,]
 941.Open  941.High   941.Low 941.Close 
    85.60     86.65      5.36     86.20 
> x[136,]
 941.Open  941.High   941.Low 941.Close 
    84.15     85.60     54.20     85.45 

Here is my code to fit the distribution and calculate the probabilities of each data point:

x <- coredata(p[,1:4])
mu <- apply(x, 2, mean)
sigma <- cov.rob(x)$cov
prob <- apply(x, 1, dmnorm, mean = mu, varcov = sigma, log = TRUE)

However, this code throws up the following error:

Error in pd.solve(varcov, log.det = TRUE) : x appears to be not symmetric

This error did not come up when I used the standard cov() function to calculate the covariance matrix, but only with the Robust covariance matrix function. The covariance matrix itself looks quite benign to me so I'm not sure what is going on. The reasons I want to use a robust estimation of the covariance matrix is because the standard covariance matrix was giving a few false positives as I was including anomalies in my training set.

Can someone tell me:


Solution

  • You are going right. I don't know the how to fix the error but I can suggest you how to find the indices of novel points.

    1. Your training data may have small number of anomalies otherwise cross-validation and test results will suffer
    2. Calculate mean and covariance in MATLAB as

         mu=mean(traindata);
         sigma=cov(traindata);
      
    3. set epsilon as per your requirement

      count=0;
        Actual=[];
        for j=1:size(cv,1)
          p=mvnpdf(cv(j,:),mu,sigma);
            if p<eplison
              count=count+1;
              Actual=[Actual;j];
                 fprintf('j=%d \t p=%e\n',j,p);
           end
      end
      
    4. tune threshold if results are not satisfactory

    5. Evaluate model using F-1 score(if F-1 score is 1, you did it right)
    6. Apply the model on test data
    7. done!