I have a base table like below:
score_upd (Upd_dt,Url,Score) AS (
SELECT DATE '2019-07-26','A','x'
UNION ALL SELECT DATE '2019-07-26','B','alpha'
UNION ALL SELECT DATE '2019-08-01','A','y'
UNION ALL SELECT DATE '2019-08-01','B','beta'
UNION ALL SELECT DATE '2019-08-03','A','z'
UNION ALL SELECT DATE '2019-08-03','B','gamma'
)
Upd_dt URL Score
2019-07-26 A x
2019-07-26 B alpha
2019-08-01 A y
2019-08-01 B beta
2019-08-03 A z
2019-08-03 B gamma
And I want to create a table in daily-url level, using most previous date's value for the new rows, result should look like below:
score_upd (Upd_dt,Url,Score) AS (
SELECT DATE '2019-07-26','A','x'
UNION ALL SELECT DATE '2019-07-26','B','alpha'
UNION ALL SELECT DATE '2019-07-27','A','x'
UNION ALL SELECT DATE '2019-07-27','B','alpha'
UNION ALL SELECT DATE '2019-07-28','A','x'
UNION ALL SELECT DATE '2019-07-28','B','alpha'
UNION ALL SELECT DATE '2019-07-29','A','x'
UNION ALL SELECT DATE '2019-07-29','B','alpha'
UNION ALL SELECT DATE '2019-07-30','A','x'
UNION ALL SELECT DATE '2019-07-30','B','alpha'
UNION ALL SELECT DATE '2019-07-31','A','x'
UNION ALL SELECT DATE '2019-07-31','B','alpha'
UNION ALL SELECT DATE '2019-08-01','A','y'
UNION ALL SELECT DATE '2019-08-01','B','beta'
UNION ALL SELECT DATE '2019-08-02','A','y'
UNION ALL SELECT DATE '2019-08-02','B','beta'
UNION ALL SELECT DATE '2019-08-03','A','z'
UNION ALL SELECT DATE '2019-08-03','B','gamma'
UNION ALL SELECT DATE '2019-08-04','A','z'
UNION ALL SELECT DATE '2019-08-04','B','gamma'
UNION ALL SELECT DATE '2019-08-05','A','z'
UNION ALL SELECT DATE '2019-08-05','B','gamma'
)
Which looks like:
Upd_dt URL Score
2019-07-26 A x
2019-07-26 B alpha
2019-07-27 A x
2019-07-27 B alpha
2019-07-28 A x
2019-07-28 B alpha
2019-07-29 A x
2019-07-29 B alpha
2019-07-30 A x
2019-07-30 B alpha
2019-07-31 A x
2019-07-31 B alpha
2019-08-01 A y
2019-08-01 B beta
2019-08-02 A y
2019-08-02 B beta
2019-08-03 A z
2019-08-03 B gamma
2019-08-04 A z
2019-08-04 B gamma
2019-08-05 A z
2019-08-05 B gamma
.
.
.
Current process is: I built a daily dimension table since 7/26/2019 till today by:
/* SELECT CAST(slice_time AS DATE) dates FROM testcalendar mtc TIMESERIES slice_time as '1 day' OVER (ORDER BY CAST(mtc.dates as TIMESTAMP)); */
so I get:
Dates
2019-07-26
2019-07-27
2019-07-28
2019-07-29
.
.
.
2019-10-12 (today)
I'm thinking if I can use function such as "interpolate previous value" to join my first table by dates, to generate missing days by using values from most previous date data, while it failed.
The result didn't generate rows for missing days.
Please let me know if anyone has any better idea on this.
Thanks!
As a starting warning : only store a "daily photograph" when it really, really is necessary. In my past, I once ended up having 364 rows too many per year, as the values only changed once a year. In Vertica, that costs license, and CPU and clock time for joining and grouping ...
But, for the rest - Good start.
But you could apply the TIMESERIES without having to build a calendar.
The trick is to "extrapolate" manually what you can INTERPOLATE
automatically.
Add an in-line 'padding' table, which contains the newest value per URL, but give it CURRENT_DATE
instead of the newest actual date - using Vertica's peculiar analytic limit clause LIMIT 1 OVER(PARTITION BY url ORDER BY upd_dt DESC)
.
UNION SELECT that padding table with your input, and apply the TIMESERIES clause to that UNION SELECT.
Like so:
WITH
-- your input ...
score_upd (Upd_dt,Url,Score) AS (
SELECT DATE '2019-07-26','A','x'
UNION ALL SELECT DATE '2019-07-26','B','alpha'
UNION ALL SELECT DATE '2019-08-01','A','y'
UNION ALL SELECT DATE '2019-08-01','B','beta'
UNION ALL SELECT DATE '2019-08-03','A','z'
UNION ALL SELECT DATE '2019-08-03','B','gamma'
)
-- real WITH clause would start here ...
,
-- newest row per Url, just with current date
pad_newest AS (
SELECT
CURRENT_DATE
, url
, score
FROM score_upd
LIMIT 1 OVER(PARTITION BY url ORDER BY upd_dt DESC)
)
,
with_newest AS (
SELECT
*
FROM score_upd
UNION ALL
SELECT *
FROM pad_newest
)
SELECT
ts_dt::DATE AS upd_dt
, url AS url
, TS_FIRST_VALUE(score) AS score
FROM with_newest
TIMESERIES ts_dt AS '1 day' OVER (
PARTITION BY url ORDER BY upd_dt::TIMESTAMP
)
ORDER BY 1,2
;