I have vectors of barometric pressure data, in millibars. It is common practice to display barometric trends over the past 3 hours, as symbols that split the series into two segments. These segments split the data into variations of falling, rising, and steady classifications (nine different classifications total, see https://static1.squarespace.com/static/561db783e4b0383208b79bb3/t/56833dac1115e07a056a8b53/1451441580880/Pressure+Tendency+%2810%29.pdf for more details)
Visually, it's easy to pick out these segments. Programmatically, I'm having trouble automatically detecting the beginning and end of statistically significant LOCAL trends in the data. Also, I would prefer to be able to use 3 or more segments if needed, rather than 2, as two segments is often not representative of the past 3 hours (common for pressure to rise, fall, become steady, or some other 3 or greater-segment pattern during this time).
Using this data and the following code, I have created a dataframe that contains the local maximums and minimums on a time resolution as high as 20 minutes. The find_peaks function was taken from: https://stats.stackexchange.com/questions/22974/how-to-find-local-peaks-valleys-in-a-series-of-data. Note that 210 minutes (3.5 hrs) of data is required as local trends are right-sided.
data <- c(966.6003, 966.5772, 966.6102, 966.5933, 966.5768, 966.565, 966.5139,
966.4826, 966.4626, 966.4595, 966.5143, 966.5732, 966.636, 966.6833, 966.7229,
966.7517, 966.7523, 966.7955, 966.8255, 966.7964, 966.8406, 966.9269, 966.9753,
966.9807, 967.001, 967.0519, 967.0389, 967.107, 967.098, 967.0935, 967.0296, 967.0562,
966.9952, 967.0031, 966.9976, 966.9888, 966.9832, 967.0474, 966.991, 967.0148,
967.0944, 967.0556, 967.0732, 967.051, 967.1252, 967.0908, 967.0746, 967.0633,
967.0821, 967.1274, 967.1229, 967.0838, 967.045, 967.0626, 967.0178, 967.024, 967.0806,
967.0135, 967.0191, 967.0537, 967.0317, 967.0933, 967.0366, 967.0532, 967.0474,
967.0831, 967.0982, 967.1099, 967.0803, 967.0351, 967.0497, 967.0833, 967.0707,
967.1096, 967.1709, 967.1645, 967.1979, 967.1406, 967.1523, 967.1288, 967.1345,
967.1495, 967.1555, 967.1013, 967.0684, 967.1175, 967.1157, 967.0471, 967.0621,
967.0117, 966.9926, 966.9926, 966.9672, 966.9619, 966.982, 966.9976, 966.9993,
967.001, 966.9885, 966.9634, 966.9919, 967.0289, 967.0289, 967.037, 967.0547, 967.0158,
967.0444, 966.9938, 966.9659, 966.9753, 966.9716, 966.9322, 966.9338, 966.8095,
966.7881, 966.7939, 966.7171, 966.738, 966.7165, 966.7402, 966.7237, 966.6966,
966.7085, 966.7268, 966.7099, 966.687, 966.6978, 966.6718, 966.6927, 966.7132,
966.6773, 966.5689, 966.5837, 966.4875, 966.4287, 966.4399, 966.4438, 966.4666,
966.4139, 966.3791, 966.3496, 966.3203, 966.2857, 966.2732, 966.2787, 966.2605,
966.2548, 966.2368, 966.2141, 966.2078, 966.1792, 966.1958, 966.1841, 966.1838,
966.1778, 966.127, 966.1378, 966.1152, 966.1035, 966.126, 966.0864, 966.1371, 966.1143,
966.0972, 966.1028, 966.1141, 966.1089, 966.1031, 966.1031, 966.1075, 966.051,
966.1406, 966.1064, 966.1511, 966.1337, 966.1617, 966.1443, 966.1216, 966.0759,
966.0251, 966.0699, 966.0417, 966.0471, 966.1147, 966.0694, 966.0919, 966.0938, 966.03,
966.0525, 966.0794, 966.0342, 966.0557, 966.033, 966.014, 966.0384, 966.0267, 965.981,
965.9422, 965.9099, 965.8817, 965.8833, 965.8833, 965.9128, 965.879, 965.8757,
965.8965, 965.9114, 965.8699, 965.8426, 965.8389)
plot(data, type = "l")
find_peaks <- function (x, m){
shape <- diff(sign(diff(x, na.pad = FALSE)))
pks <- sapply(which(shape < 0), FUN = function(i){
z <- i - m + 1
z <- ifelse(z > 0, z, 1)
w <- i + m + 1
w <- ifelse(w < length(x), w, length(x))
if(all(x[c(z : i, (i + 2) : w)] <= x[i + 1])) return(i + 1) else return(numeric(0))
})
pks <- unlist(pks)
pks}
# make data frame of local max and mins (threshold of 21 minutes width)
maximums_df <- data.frame(trend = 'max', index = find_peaks(data, m = 10))
minimums_df <- data.frame(trend = 'min', index = find_peaks(data, m = 10))
maxmin_df <- rbind(maximums_df, minimums_df)[order(rbind(maximums_df,
minimums_df)$index),]
I have tried many different ways of automatically defining appropriate segments in the data. There should be 3 in this data set. Plotting the data clearly shows the first one is rising, the second is steady, then the largest, and last, is falling pressure. The threshold for falling or rising is absolute value of > +- 0.1 mb/hr.
Performing a moving linear regression is plausible, but again, defining the beginning/end of a trend is difficult with this as matching the max/min's of the data series is difficult. There should be a way using just the maxmin_df, I'm thinking, but it's escaping me at the moment. Any help or ideas would be wonderful.
You could try segmented
regression which estimates breakpoints. You can eyeball starting values psi=c(50, 100, 150)
.
> fit <- lm(y ~ x, df)
> library(segmented)
> sfit <- segmented::segmented(fit, seg.Z=~x, psi=c(50, 100, 150),
+ control=seg.control(display=FALSE))
> summary(sfit)
***Regression Model with Segmented Relationship(s)***
Call:
segmented.lm(obj = fit, seg.Z = ~x, psi = c(50, 100, 150), control = seg.control(display = FALSE))
Estimated Break-Point(s):
Est. St.Err
psi1.x 29.531 1.120
psi2.x 104.665 1.156
psi3.x 149.889 1.647
Coefficients of the linear terms:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.664e+02 2.175e-02 44434.73 <2e-16 ***
x 2.216e-02 1.266e-03 17.50 <2e-16 ***
U1.x -2.231e-02 1.302e-03 -17.13 NA
U2.x -1.858e-02 7.221e-04 -25.73 NA
U3.x 1.367e-02 7.752e-04 17.63 NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.05705 on 202 degrees of freedom
Multiple R-Squared: 0.9832, Adjusted R-squared: 0.9826
Boot restarting based on 6 samples. Last fit:
Convergence attained in 3 iterations (rel. change 4.2078e-13)
> plot(sfit, ylim=range(df$y))
> lines(df$y)
> legend('topright', legend=c('data', 'segmented fit'), lty=1, col=c(1, 'red'))
Data:
> df <- data.frame(y=data, x=seq_along(data)) ## from OP