I have a timeseries dataset with 'n' number of columns. In the dataset, I would like to filter and remove the hours for which the value in a column changed within the hour. In other words, I want to keep the hours that has unchanged value.
Some info about the data:
1-minute data with date in POSCIXT format (%Y-%m-%d %H:%M:%S)
The data is recorded for every 5 minutes, so for the time, for which the data is not recorded is filled with NA
data1<- structure(list(PDT = structure(c(1668927600, 1668927660,
1668927720,
1668927780, 1668927840, 1668927900, 1668927960, 1668928020, 1668928080,
1668928140, 1668928200, 1668928260, 1668928320, 1668928380, 1668928440,
1668928500, 1668928560, 1668928620, 1668928680, 1668928740, 1668928800,
1668928860, 1668928920, 1668928980, 1668929040, 1668929100, 1668929160,
1668929220, 1668929280, 1668929340, 1668929400, 1668929460, 1668929520,
1668929580, 1668929640, 1668929700, 1668929760, 1668929820, 1668929880,
1668929940, 1668930000, 1668930060, 1668930120, 1668930180, 1668930240,
1668930300, 1668932640, 1668932700, 1668932760, 1668932820, 1668932880,
1668932940, 1668933000, 1668933060, 1668933120, 1668933180, 1668933240,
1668933300, 1668933360, 1668933720, 1668933780, 1668933840, 1668933900,
1668933960, 1668934020, 1668934080, 1668934140, 1668934200, 1668934260,
1668934320, 1668934380, 1668934440, 1668934500, 1668934560, 1668934620,
1668934680, 1668934740, 1668934800), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), Date = structure(c(1668902400, 1668902400,
1668902400, 1668902400, 1668902400, 1668902400, 1668902400, 1668902400,
1668902400, 1668902400, 1668902400, 1668902400, 1668902400, 1668902400,
1668902400, 1668902400, 1668902400, 1668902400, 1668902400, 1668902400,
1668902400, 1668902400, 1668902400, 1668902400, 1668902400, 1668902400,
1668902400, 1668902400, 1668902400, 1668902400, 1668902400, 1668902400,
1668902400, 1668902400, 1668902400, 1668902400, 1668902400, 1668902400,
1668902400, 1668902400, 1668902400, 1668902400, 1668902400, 1668902400,
1668902400, 1668902400, 1668902400, 1668902400, 1668902400, 1668902400,
1668902400, 1668902400, 1668902400, 1668902400, 1668902400, 1668902400,
1668902400, 1668902400, 1668902400, 1668902400, 1668902400, 1668902400,
1668902400, 1668902400, 1668902400, 1668902400, 1668902400, 1668902400,
1668902400, 1668902400, 1668902400, 1668902400, 1668902400, 1668902400,
1668902400, 1668902400, 1668902400, 1668902400), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), DayOfWeek = c("Sunday", "Sunday",
"Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday",
"Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday",
"Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday",
"Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday",
"Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday",
"Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday",
"Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday",
"Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday",
"Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday",
"Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday",
"Sunday", "Sunday", "Sunday", "Sunday", "Sunday", "Sunday"),
Month = c(11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
11, 11, 11, 11, 11, 11), Day = c(20, 20, 20, 20, 20, 20,
20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20,
20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20,
20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20,
20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20,
20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20), Hour_hr = c(7,
7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
9), Minute = c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27,
28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42,
43, 44, 45, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
36, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55,
56, 57, 58, 59, 0), ColA = c(NA, NA, NA, 65.5, NA, NA, NA,
NA, 65.5, NA, NA, NA, NA, 65.5, NA, NA, NA, NA, 65.5, NA,
NA, NA, NA, 65.5, NA, NA, NA, NA, 65.5, NA, NA, NA, NA, 65.5,
NA, NA, NA, NA, 65.5, NA, NA, NA, NA, 65.5, NA, NA, NA, NA,
NA, NA, 65.5, NA, NA, NA, NA, 65.5, NA, NA, NA, NA, 65.5,
NA, NA, NA, NA, 65.5, NA, NA, NA, NA, 65.5, NA, NA, NA, NA,
65.7, NA, NA)), row.names = 3041:3118, class = "data.frame")
Expected output:
In the above example, I want to exclude hour 8 from my dataset, as the value in ColA is not constant.
I have a feeling that group_by()
and filter()
from dplyr might do the job, but I am not sure about the function to find the unchanged values within an hour.
Any help regarding this is much appreciated. Thanks.
This does it:
data1 %>% group_by(Hour_hr) %>% filter(n_distinct(ColA) < 3)
Checking results:
count(data1, Hour_hr)
Hour_hr n
<dbl> <int>
1 7 46
2 9 1
This will keep colA if there's only one numerical value or no numerical values (NA), keeping hour 7 and 9.
Equivalently you could do:
data1 %>% group_by(Hour_hr) %>% filter(n_distinct(ColA, na.rm = T) < 2)