postgresql time-series amazon-redshift noise-reduction

Address and smoothen noise in sensor data

I have sensors data as below wherein under Data Column, there are 6rows containing value 45 in between preceding and following rows containing value 50. The requirement is to clean this data and impute with 50 (prev value) in the new_data column. Moreover, the no of noise records (shown as 45 in table) might either vary in number or with level of rows.

Case 1 (sample data) :-

Sl.no	Timestamp	Data	New_data
1	1/1/2021 0:00:00	50	50
2	1/1/2021 0:15:00	50	50
3	1/1/2021 0:30:00	50	50
4	1/1/2021 0:45:00	50	50
5	1/1/2021 1:00:00	50	50
6	1/1/2021 1:15:00	50	50
7	1/1/2021 1:30:00	50	50
8	1/1/2021 1:45:00	50	50
9	1/1/2021 2:00:00	50	50
10	1/1/2021 2:15:00	50	50
11	1/1/2021 2:30:00	45	50
12	1/1/2021 2:45:00	45	50
13	1/1/2021 3:00:00	45	50
14	1/1/2021 3:15:00	45	50
15	1/1/2021 3:30:00	45	50
16	1/1/2021 3:45:00	45	50
17	1/1/2021 4:00:00	50	50
18	1/1/2021 4:15:00	50	50
19	1/1/2021 4:30:00	50	50
20	1/1/2021 4:45:00	50	50
21	1/1/2021 5:00:00	50	50
22	1/1/2021 5:15:00	50	50
23	1/1/2021 5:30:00	50	50

I am thinking of a need to group these data ordered by timestamp asc (like below) and then could have a condition in place where it will have to check group by group in large sample data and if group 1 is same as group 3 , replace group 2 with group 1 values.

Sl.no	Timestamp	Data	New_data	group
1	1/1/2021 0:00:00	50	50	1
2	1/1/2021 0:15:00	50	50	1
3	1/1/2021 0:30:00	50	50	1
4	1/1/2021 0:45:00	50	50	1
5	1/1/2021 1:00:00	50	50	1
6	1/1/2021 1:15:00	50	50	1
7	1/1/2021 1:30:00	50	50	1
8	1/1/2021 1:45:00	50	50	1
9	1/1/2021 2:00:00	50	50	1
10	1/1/2021 2:15:00	50	50	1
11	1/1/2021 2:30:00	45	50	2
12	1/1/2021 2:45:00	45	50	2
13	1/1/2021 3:00:00	45	50	2
14	1/1/2021 3:15:00	45	50	2
15	1/1/2021 3:30:00	45	50	2
16	1/1/2021 3:45:00	45	50	2
17	1/1/2021 4:00:00	50	50	3
18	1/1/2021 4:15:00	50	50	3
19	1/1/2021 4:30:00	50	50	3
20	1/1/2021 4:45:00	50	50	3
21	1/1/2021 5:00:00	50	50	3
22	1/1/2021 5:15:00	50	50	3
23	1/1/2021 5:30:00	50	50	3

Moreover, there is also a need to add an exception like, if the next group is having similar pattern, not to change but to retain the data as it is. Ex below : If group 1 and group 3 are same , impute group 2 with group 1 value. But if group 2 and group 4 are same, do not change group 3 , retain same data in New_data.

Case 2:-

Sl.no	Timestamp	Data	New_data	group
1	1/1/2021 0:00:00	50	50	1
2	1/1/2021 0:15:00	50	50	1
3	1/1/2021 0:30:00	50	50	1
4	1/1/2021 0:45:00	50	50	1
5	1/1/2021 1:00:00	50	50	1
6	1/1/2021 1:15:00	50	50	1
7	1/1/2021 1:30:00	50	50	1
8	1/1/2021 1:45:00	50	50	1
9	1/1/2021 2:00:00	50	50	1
10	1/1/2021 2:15:00	50	50	1
11	1/1/2021 2:30:00	45	50	2
12	1/1/2021 2:45:00	45	50	2
13	1/1/2021 3:00:00	45	50	2
14	1/1/2021 3:15:00	45	50	2
15	1/1/2021 3:30:00	45	50	2
16	1/1/2021 3:45:00	45	50	2
17	1/1/2021 4:00:00	50	50	3
18	1/1/2021 4:15:00	50	50	3
19	1/1/2021 4:30:00	50	50	3
20	1/1/2021 4:45:00	50	50	3
21	1/1/2021 5:00:00	50	50	3
22	1/1/2021 5:15:00	50	50	3
23	1/1/2021 5:30:00	50	50	3
24	1/1/2021 5:45:00	45	45	4
25	1/1/2021 6:00:00	45	45	4
26	1/1/2021 6:15:00	45	45	4
27	1/1/2021 6:30:00	45	45	4
28	1/1/2021 6:45:00	45	45	4
29	1/1/2021 7:00:00	45	45	4
30	1/1/2021 7:15:00	45	45	4
31	1/1/2021 7:30:00	45	45	4

Reaching out for help in coding in postgresql to address above scenario. Please feel free to suggest any alternative approaches to solve above problem.

Solution

The query below should answer the need.

The first query identifies the rows which correspond to a change of data.
The second query groups the rows between two successive changes of data and set up the corresponding range of timestamp
The third query is a recursive query which calculates the new_data in an iterative way according to the timestamp order.

The last query display the expected result.

WITH RECURSIVE list As
(
SELECT no
     , timestamp
     , lag(data) OVER w AS previous
     , data
     , lead(data) OVER w AS next
     , data IS DISTINCT FROM lag(data) OVER w AS first
     , data IS DISTINCT FROM lead(data) OVER w AS last
  FROM sensors
WINDOW w AS (ORDER BY timestamp ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)
), range_list AS
(
SELECT tsrange(timestamp, lead(timestamp) OVER w, '[]') AS range
     , previous
     , data
     , lead(next) OVER w AS next
     , first
  FROM list
 WHERE first OR last
WINDOW w AS (ORDER BY timestamp ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
), rec_list (range, previous, data, next, new_data, arr) AS
(
SELECT range
     , previous
     , data
     , next
     , data
     , array[range]
  FROM range_list
 WHERE previous IS NULL
UNION ALL
SELECT c.range
     , p.data
     , c.data
     , c.next
     , CASE 
         WHEN p.new_data IS NOT DISTINCT FROM c.next
         THEN p.data
         ELSE c.data
       END
     , p.arr || c.range
  FROM rec_list AS p
 INNER JOIN range_list AS c
    ON lower(c.range) =  upper(p.range) + interval '15 minutes'
 WHERE NOT array[c.range] <@ p.arr
   AND first
 )
SELECT s.*, r.new_data
  FROM sensors AS s
 INNER JOIN rec_list AS r
    ON r.range @> s.timestamp
 ORDER BY timestamp

see the test result in dbfiddle