I have a kind of log table that I would like to transform using SQL Server - adding a validity interval (valid-from, valid-to fields) to records.
The source table is rather large (100K+ records), so an optimal solution would be required. It is acutally a table that tracks content (status) changes in a field for individual records.
Illustration of the source table:
RecordId | DateStamp | OldValue | Newvalue |
---|---|---|---|
1 | 2021.01.01. | A Value | Another Value |
1 | 2021.02.01 | Another Value | Yet Another Value |
1 | 2021.03.01 | Yet Another Value | A Value |
1 | 2021.04.01 | A Value | Yet Another Value |
4 | 2021.01.01 | B Value | Next Value |
4 | 2021.02.01 | Next Value | Second Next Value |
4 | 2021.03.01 | Second Next Value | Final Value |
RecordId
refers to the record on which the field change has been made. OldValue
and NewValue
refers to the value of the field before and after the change has been made.
Changes are consecutive. RecordId
and OldValue
and NewValue
fields are arbitrary.
Value
fields can switch back and forth between statuses.
Definition of the source table:
CREATE TABLE Log (
RecordId INT
, DateStamp DATE
, OldValue VARCHAR(255)
, NewValue VARCHAR(255)
);
Example data for table:
INSERT INTO Log (RecordId, DateStamp, OldValue, NewValue) VALUES (1, '2021-01-01', 'A Value', 'Another Value');
INSERT INTO Log (RecordId, DateStamp, OldValue, NewValue) VALUES (1, '2021-02-01', 'Another Value', 'Yet Another Value');
INSERT INTO Log (RecordId, DateStamp, OldValue, NewValue) VALUES (1, '2021-03-01', 'Yet Another Value', 'A Value');
INSERT INTO Log (RecordId, DateStamp, OldValue, NewValue) VALUES (1, '2021-04-01', 'A Value', 'Yet Another Value');
INSERT INTO Log (RecordId, DateStamp, OldValue, NewValue) VALUES (4, '2021-01-01', 'B Value', 'Next Value');
INSERT INTO Log (RecordId, DateStamp, OldValue, NewValue) VALUES (4, '2021-02-01', 'Next Value', 'Second Next Value');
INSERT INTO Log (RecordId, DateStamp, OldValue, NewValue) VALUES (4, '2021-03-01', 'Second Next Value', 'Final Value');
Illustration of the desired output table with date slicing - so a querying for a status on a given date is possible using the validity dates:
RecordId | Value | ValidFromDate | ValidToDate |
---|---|---|---|
1 | A Value | 1900.01.01. (or null) | 2020.12.31. |
1 | Another Value | 2021.01.01. | 2021.01.31. |
1 | Yet Another Value | 2022.02.01 | 2022.02.28. |
1 | A Value | 2022.03.01 | 2022.03.31. |
1 | Yet Another Value | 2022.04.01 | 9999.12.31 (or null) |
4 | B Value | 1900.01.01 (or null) | 2021.12.31 |
4 | Next Value | 2022.01.01 | 2022.01.31. |
4 | Second Next Value | 2022.02.01 | 2022.02.28. |
4 | Final | 2022.03.01 | 9999.12.31 (or null) |
Is there an efficient way to do this?
One method would be to unpivot the first row, to get 2 rows for the "old" and "new" value, and then you can use LEAD
to get the valid to date. You can determine the first row, by using ROW_NUMBER
in a CTE (or derived table), and use the WHERE
to only return the 1st row, or the "new" rows:
WITH CTE AS(
SELECT RecordId,
DateStamp,
OldValue,
NewValue,
ROW_NUMBER() OVER (PARTITION BY RecordID ORDER BY DateStamp) AS RN
FROM (VALUES(1,CONVERT(date,'20210101'),'A Value ','Another Value'),
(1,CONVERT(date,'20210201'),'Another Value ','Yet Another Value'),
(1,CONVERT(date,'20210301'),'Yet Another Value ','A Value'),
(1,CONVERT(date,'20210401'),'A Value ','Yet Another Value'),
(4,CONVERT(date,'20210101'),'B Value ','Next Value'),
(4,CONVERT(date,'20210201'),'Next Value ','Second Next Value'),
(4,CONVERT(date,'20210301'),'Second Next Value ','Final Value'))V(RecordId,DateStamp,OldValue,NewValue))
SELECT C.RecordId,
V.Value AS Value,
CASE V.ValueType WHEN 'New' THEN C.DateStamp END AS ValidFromDate,
CASE V.ValueType WHEN 'New' THEN LEAD(DATEADD(DAY, -1, C.DateStamp)) OVER (PARTITION BY C.RecordID ORDER BY C.DateStamp) ELSE DATEADD(DAY, -1, C.DateStamp) END AS ValidToDate
FROM CTE C
CROSS APPLY (VALUES(C.OldValue, 'Old'),(C.NewValue, 'New'))V(Value,ValueType)
WHERE RN = 1
OR ValueType = 'New';