I have sensors data as below wherein under Data Column, there are 6rows containing value 45 in between preceding and following rows containing value 50. The requirement is to clean this data and impute with 50 (prev value) in the new_data column. Moreover, the no of noise records (shown as 45 in table) might either vary in number or with level of rows.
Case 1 (sample data) :-
Sl.no | Timestamp | Data | New_data |
---|---|---|---|
1 | 1/1/2021 0:00:00 | 50 | 50 |
2 | 1/1/2021 0:15:00 | 50 | 50 |
3 | 1/1/2021 0:30:00 | 50 | 50 |
4 | 1/1/2021 0:45:00 | 50 | 50 |
5 | 1/1/2021 1:00:00 | 50 | 50 |
6 | 1/1/2021 1:15:00 | 50 | 50 |
7 | 1/1/2021 1:30:00 | 50 | 50 |
8 | 1/1/2021 1:45:00 | 50 | 50 |
9 | 1/1/2021 2:00:00 | 50 | 50 |
10 | 1/1/2021 2:15:00 | 50 | 50 |
11 | 1/1/2021 2:30:00 | 45 | 50 |
12 | 1/1/2021 2:45:00 | 45 | 50 |
13 | 1/1/2021 3:00:00 | 45 | 50 |
14 | 1/1/2021 3:15:00 | 45 | 50 |
15 | 1/1/2021 3:30:00 | 45 | 50 |
16 | 1/1/2021 3:45:00 | 45 | 50 |
17 | 1/1/2021 4:00:00 | 50 | 50 |
18 | 1/1/2021 4:15:00 | 50 | 50 |
19 | 1/1/2021 4:30:00 | 50 | 50 |
20 | 1/1/2021 4:45:00 | 50 | 50 |
21 | 1/1/2021 5:00:00 | 50 | 50 |
22 | 1/1/2021 5:15:00 | 50 | 50 |
23 | 1/1/2021 5:30:00 | 50 | 50 |
I am thinking of a need to group these data ordered by timestamp asc (like below) and then could have a condition in place where it will have to check group by group in large sample data and if group 1 is same as group 3 , replace group 2 with group 1 values.
Sl.no | Timestamp | Data | New_data | group |
---|---|---|---|---|
1 | 1/1/2021 0:00:00 | 50 | 50 | 1 |
2 | 1/1/2021 0:15:00 | 50 | 50 | 1 |
3 | 1/1/2021 0:30:00 | 50 | 50 | 1 |
4 | 1/1/2021 0:45:00 | 50 | 50 | 1 |
5 | 1/1/2021 1:00:00 | 50 | 50 | 1 |
6 | 1/1/2021 1:15:00 | 50 | 50 | 1 |
7 | 1/1/2021 1:30:00 | 50 | 50 | 1 |
8 | 1/1/2021 1:45:00 | 50 | 50 | 1 |
9 | 1/1/2021 2:00:00 | 50 | 50 | 1 |
10 | 1/1/2021 2:15:00 | 50 | 50 | 1 |
11 | 1/1/2021 2:30:00 | 45 | 50 | 2 |
12 | 1/1/2021 2:45:00 | 45 | 50 | 2 |
13 | 1/1/2021 3:00:00 | 45 | 50 | 2 |
14 | 1/1/2021 3:15:00 | 45 | 50 | 2 |
15 | 1/1/2021 3:30:00 | 45 | 50 | 2 |
16 | 1/1/2021 3:45:00 | 45 | 50 | 2 |
17 | 1/1/2021 4:00:00 | 50 | 50 | 3 |
18 | 1/1/2021 4:15:00 | 50 | 50 | 3 |
19 | 1/1/2021 4:30:00 | 50 | 50 | 3 |
20 | 1/1/2021 4:45:00 | 50 | 50 | 3 |
21 | 1/1/2021 5:00:00 | 50 | 50 | 3 |
22 | 1/1/2021 5:15:00 | 50 | 50 | 3 |
23 | 1/1/2021 5:30:00 | 50 | 50 | 3 |
Moreover, there is also a need to add an exception like, if the next group is having similar pattern, not to change but to retain the data as it is. Ex below : If group 1 and group 3 are same , impute group 2 with group 1 value. But if group 2 and group 4 are same, do not change group 3 , retain same data in New_data.
Case 2:-
Sl.no | Timestamp | Data | New_data | group |
---|---|---|---|---|
1 | 1/1/2021 0:00:00 | 50 | 50 | 1 |
2 | 1/1/2021 0:15:00 | 50 | 50 | 1 |
3 | 1/1/2021 0:30:00 | 50 | 50 | 1 |
4 | 1/1/2021 0:45:00 | 50 | 50 | 1 |
5 | 1/1/2021 1:00:00 | 50 | 50 | 1 |
6 | 1/1/2021 1:15:00 | 50 | 50 | 1 |
7 | 1/1/2021 1:30:00 | 50 | 50 | 1 |
8 | 1/1/2021 1:45:00 | 50 | 50 | 1 |
9 | 1/1/2021 2:00:00 | 50 | 50 | 1 |
10 | 1/1/2021 2:15:00 | 50 | 50 | 1 |
11 | 1/1/2021 2:30:00 | 45 | 50 | 2 |
12 | 1/1/2021 2:45:00 | 45 | 50 | 2 |
13 | 1/1/2021 3:00:00 | 45 | 50 | 2 |
14 | 1/1/2021 3:15:00 | 45 | 50 | 2 |
15 | 1/1/2021 3:30:00 | 45 | 50 | 2 |
16 | 1/1/2021 3:45:00 | 45 | 50 | 2 |
17 | 1/1/2021 4:00:00 | 50 | 50 | 3 |
18 | 1/1/2021 4:15:00 | 50 | 50 | 3 |
19 | 1/1/2021 4:30:00 | 50 | 50 | 3 |
20 | 1/1/2021 4:45:00 | 50 | 50 | 3 |
21 | 1/1/2021 5:00:00 | 50 | 50 | 3 |
22 | 1/1/2021 5:15:00 | 50 | 50 | 3 |
23 | 1/1/2021 5:30:00 | 50 | 50 | 3 |
24 | 1/1/2021 5:45:00 | 45 | 45 | 4 |
25 | 1/1/2021 6:00:00 | 45 | 45 | 4 |
26 | 1/1/2021 6:15:00 | 45 | 45 | 4 |
27 | 1/1/2021 6:30:00 | 45 | 45 | 4 |
28 | 1/1/2021 6:45:00 | 45 | 45 | 4 |
29 | 1/1/2021 7:00:00 | 45 | 45 | 4 |
30 | 1/1/2021 7:15:00 | 45 | 45 | 4 |
31 | 1/1/2021 7:30:00 | 45 | 45 | 4 |
Reaching out for help in coding in postgresql to address above scenario. Please feel free to suggest any alternative approaches to solve above problem.
The query below should answer the need.
The first query identifies the rows which correspond to a change of data.
The second query groups the rows between two successive changes of data and set up the corresponding range of timestamp
The third query is a recursive query which calculates the new_data in an iterative way according to the timestamp order.
The last query display the expected result.
WITH RECURSIVE list As
(
SELECT no
, timestamp
, lag(data) OVER w AS previous
, data
, lead(data) OVER w AS next
, data IS DISTINCT FROM lag(data) OVER w AS first
, data IS DISTINCT FROM lead(data) OVER w AS last
FROM sensors
WINDOW w AS (ORDER BY timestamp ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)
), range_list AS
(
SELECT tsrange(timestamp, lead(timestamp) OVER w, '[]') AS range
, previous
, data
, lead(next) OVER w AS next
, first
FROM list
WHERE first OR last
WINDOW w AS (ORDER BY timestamp ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
), rec_list (range, previous, data, next, new_data, arr) AS
(
SELECT range
, previous
, data
, next
, data
, array[range]
FROM range_list
WHERE previous IS NULL
UNION ALL
SELECT c.range
, p.data
, c.data
, c.next
, CASE
WHEN p.new_data IS NOT DISTINCT FROM c.next
THEN p.data
ELSE c.data
END
, p.arr || c.range
FROM rec_list AS p
INNER JOIN range_list AS c
ON lower(c.range) = upper(p.range) + interval '15 minutes'
WHERE NOT array[c.range] <@ p.arr
AND first
)
SELECT s.*, r.new_data
FROM sensors AS s
INNER JOIN rec_list AS r
ON r.range @> s.timestamp
ORDER BY timestamp
see the test result in dbfiddle