algorithmlanguage-agnostictime-seriessignal-processingdata-analysis

Peak signal detection in realtime timeseries data


I am analyzing the following data:

Plot of data

Raw data (seperated with spaces):

1 1 1.1 1 0.9 1 1 1.1 1 0.9 1 1.1 1 1 0.9 1 1 1.1 1 1 1 1 1.1 0.9 1 1.1 1 1 0.9 1 1.1 1 1 1.1 1 0.8 0.9 1 1.2 0.9 1 1 1.1 1.2 1 1.5 1 3 2 5 3 2 1 1 1 0.9 1 1 3 2.6 4 3 3.2 2 1 1 0.8 4 4 2 2.5 1 1 1

You can clearly see that there are three large "global" peaks and some smaller "local" peaks. This data can be described as follows:

  1. There is basic noise with a general mean
  2. The distribution of the noise is in the class of exponential families of distributions
  3. There are data points that significantly deviate from the noise (peaks)
  4. The data is observed in real-time (a new data point is observed every period)

How can I identify the peaks in real-time while ignoring the general noise?


Solution

  • Robust peak detection algorithm (using z-scores)

    I came up with an algorithm that works very well for these types of datasets. It is based on the principle of dispersion: if a new datapoint is a given x number of standard deviations away from a moving mean, the algorithm gives a signal. The algorithm is very robust because it constructs a separate moving mean and deviation, such that previous signals do not corrupt the signalling threshold for future signals. The sensitivity of the algorithm is therefore robust to previous signals.

    The algorithm takes 3 inputs:

    1. Lag. The lag of the moving window that calculates the mean and standard deviation of historical data. A longer window takes more historical data in account. A shorter window is more adaptive, such that the algorithm will adapt to new information more quickly. For example, a lag of 5 will use the last 5 observations to smooth the data.
    2. Threshold. The "z-score" at which the algorithm signals. Simply put, if the distance between a new datapoint and the moving mean is larger than the threshold multiplied with the moving standard deviation of the data, the algorithm provides a signal. For example, a threshold of 3.5 will signal if a datapoint is 3.5 standard deviations away from the moving mean.
    3. Influence. The influence (between 0 and 1) of new signals on the calculation of the moving mean and moving standard deviation. For example, an influence parameter of 0.5 gives new signals half of the influence that normal datapoints have. Likewise, an influence of 0 ignores signals completely for recalculating the new threshold. An influence of 0 is therefore the most robust option (but assumes stationarity); putting the influence option at 1 is least robust. For non-stationary data, the influence option should therefore be put between 0 and 1.

    It works as follows:

    Pseudocode

    # Let y be a vector of timeseries data of at least length lag+2
    # Let mean() be a function that calculates the mean
    # Let std() be a function that calculates the standard deviaton
    # Let absolute() be the absolute value function
    
    # Settings (these are examples: choose what is best for your data!)
    set lag to 5;          # average and std. are based on past 5 observations
    set threshold to 3.5;  # signal when data point is 3.5 std. away from average
    set influence to 0.5;  # between 0 (no influence) and 1 (full influence)
    
    # Initialize variables
    set signals to vector 0,...,0 of length of y;   # Initialize signal results
    set filteredY to y(1),...,y(lag)                # Initialize filtered series
    set avgFilter to null;                          # Initialize average filter
    set stdFilter to null;                          # Initialize std. filter
    set avgFilter(lag) to mean(y(1),...,y(lag));    # Initialize first value average
    set stdFilter(lag) to std(y(1),...,y(lag));     # Initialize first value std.
    
    for i=lag+1,...,t do
      if absolute(y(i) - avgFilter(i-1)) > threshold*stdFilter(i-1) then
        if y(i) > avgFilter(i-1) then
          set signals(i) to +1;                     # Positive signal
        else
          set signals(i) to -1;                     # Negative signal
        end
        set filteredY(i) to influence*y(i) + (1-influence)*filteredY(i-1);
      else
        set signals(i) to 0;                        # No signal
        set filteredY(i) to y(i);
      end
      set avgFilter(i) to mean(filteredY(i-lag+1),...,filteredY(i));
      set stdFilter(i) to std(filteredY(i-lag+1),...,filteredY(i));
    end
    

    Rules of thumb for selecting good parameters for your data can be found below.


    Demo

    Demonstration of robust thresholding algorithm

    The Matlab code for this demo can be found here. To use the demo, simply run it and create a time series yourself by clicking on the upper chart. The algorithm starts working after drawing lag number of observations.


    Result

    For the original question, this algorithm will give the following output when using the following settings: lag = 30, threshold = 5, influence = 0:

    Thresholding algorithm example


    Implementations in different programming languages:


    Rules of thumb for configuring the algorithm

    Lag. The lag parameter determines how much your data will be smoothed and how adaptive the algorithm is to changes in the long-term average of the data. The more stationary your data is, the more lags you should include (this should improve the robustness of the algorithm). If your data contains time-varying trends, you should consider how quickly you want the algorithm to adapt to these trends. I.e., if you put lag at 10, it takes 10 periods (data points) before the treshold is adjusted to any systematic changes in the long-term average. So choose the lag parameter based on the trending behavior of your data and how adaptive you want the algorithm to be.

    Influence. This parameter determines the influence of signals on the algorithm's detection threshold. If put at 0, signals have no influence on the threshold, such that future signals are detected based on a threshold that is calculated with a mean and standard deviation that is not influenced by past signals. If put at 0.5, signals have half the influence of normal data points. Another way to think about this is that if you put the influence at 0, you implicitly assume stationarity (i.e. no matter how many signals there are, you always expect the time series to return to the same average over the long term). If this is not the case, you should put the influence parameter somewhere between 0 and 1, depending on the extent to which signals can systematically influence the time-varying trend of the data. E.g., if signals lead to a structural break of the long-term average of the time series, the influence parameter should be put high (close to 1) so the threshold can react to structural breaks quickly.

    Threshold. The threshold parameter is the number of standard deviations from the moving mean above which the algorithm will classify a new datapoint as being a signal. For example, if a new datapoint is 4.0 standard deviations above the moving mean and the threshold parameter is set as 3.5, the algorithm will identify the datapoint as a signal. This parameter should be set based on how many signals you expect. For example, if your data is normally distributed, a threshold (or: z-score) of 3.5 corresponds to a signaling probability of 0.00047 (from this table), which implies that you expect a signal once every 2128 datapoints (1/0.00047). The threshold therefore directly influences how sensitive the algorithm is and thereby also determines how often the algorithm signals. Examine your own data and choose a sensible threshold that makes the algorithm signal when you want it to (some trial-and-error might be needed here to get to a good threshold for your purpose).


    WARNING: The code above always loops over all datapoints everytime it runs. When implementing this code, make sure to split the calculation of the signal into a separate function (without the loop). Then when a new datapoint arrives, update filteredY, avgFilter and stdFilter once. Do not recalculate the signals for all data everytime there is a new datapoint (like in the example above), that would be extremely inefficient and slow in real-time applications.

    Other ways to modify the algorithm (for potential improvements) are:

    1. Use median instead of mean
    2. Use a robust measure of scale, such as the median absolute deviation (MAD), instead of the standard deviation
    3. Use a signal margin, so the signal doesn't switch too often
    4. Change the way the influence parameter works
    5. Treat up and down signals differently (asymmetric treatment)
    6. Create a separate influence parameter for the mean and std (as in this Swift translation)

    (Known) academic citations to this StackOverflow answer:

    Other works using the algorithm from this answer

    Other applications of the algorithm from this answer

    Links to other peak detection algorithms


    How to reference this algorithm:

    Brakel, J.P.G. van (2014). "Robust peak detection algorithm using z-scores". Stack Overflow. Available at: https://stackoverflow.com/questions/22583391/peak-signal-detection-in-realtime-timeseries-data/22640362#22640362 (version: 2020-11-08).

    Bibtex @misc{brakel2014, author = {Brakel, J.P.G van}, title = {Robust peak detection algorithm using z-scores}, url = {https://stackoverflow.com/questions/22583391/peak-signal-detection-in-realtime-timeseries-data/22640362#22640362}, language = {en}, year = {2014}, urldate = {2022-04-12}, journal = {Stack Overflow}, howpublished = {https://stackoverflow.com/questions/22583391/peak-signal-detection-in-realtime-timeseries-data/22640362#22640362}}


    If you use this function somewhere, please credit me by using above reference. If you have any questions about the algorithm, post them in the comments below or contact me on LinkedIn.