r

Identify single outliers using `diff()`


I'm trying to implement a rather simple outlier check as proposed in Zahumenský (2004): Guidelines on Quality Control Procedures for Data from Automatic Weather Stations.

According to the article, when checking for the maximum allowed variability of measurements, values differing from the prior one by more than a specific limit fail the check and should be flagged as doubtful. That is what I tried with e.g. air temperature values using a critical delta of 3 K per time step.

What I tried so far:

set.seed(42)
ta <- rnorm(60, mean = 12, sd = 0.2) |> round(1)

# generate outliers
ta[15] <- 39
ta[45] <- -19

# carry out step test
ta_delta <- c(NA, diff(ta))
ta_delta
#>  [1]    NA  -0.4   0.2   0.0   0.0  -0.1   0.3  -0.3   0.4  -0.4   0.3   0.2
#> [13]  -0.8   0.2  27.1 -26.9  -0.2  -0.4   0.0   0.8  -0.4  -0.3   0.4   0.2
#> [25]   0.2  -0.5   0.0  -0.3   0.5  -0.2   0.2   0.0   0.1  -0.3   0.2  -0.4
#> [37]   0.1   0.0  -0.3   0.5   0.0  -0.1   0.3  -0.3 -30.9  31.1  -0.3   0.5
#> [49]  -0.4   0.2   0.0  -0.3   0.5  -0.2  -0.1   0.1   0.0  -0.1  -0.6   0.7

length(ta_delta)
#> [1] 60

which(abs(ta_delta) > 3)
#> [1] 15 16 45 46

What I would need in the end is a logical vector of the same length as ta (check), flagging the outliers only with TRUE at indices 15 and 45.

Currently, I'm not identifying the outliers only but also the subsequent value, so my idea was to narrow down abs(ta_delta) > 3 to individual segments holding c(TRUE, TRUE) and just pick the global index of the first TRUE, hopefully resulting in the desired output. But I'm also pretty sure there are probably some functions resp. ideas how to tackle this which I'm missing.

Thank you very much in advance!


Solution

  • Does this do what you want?

    log_vec <- diff(abs(ta_delta) > 3) == -1  # Or < 0
    print(log_vec)
    # [1]    NA FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
    # [19] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
    # [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
    # [55] FALSE FALSE FALSE FALSE FALSE
    
    which(log_vec)
    # [1] 16 46 
    

    You can pad with FALSE as needed and correct any leading NAs.

    The code makes sense as a (form of) numerical second order differentiation and finding a local maximum.

    Edit:

    More direct "numerical differention" would be:

    which(diff(abs(diff(ta))) < -3)
    # [1] 15 45