I have a dataset which looks like below:
ITEM CITY START_Y START_W FIRST_USE_Y FIRST_USE_W VALUE
A NEW YORK 2023 30 2023 32 15000
A LONDON 2024 2 2024 2 12000
A LONDON 2024 2 2024 5 50000
B NEW YORK 2023 49 2024 1 19540
B MADRID 2023 10 2023 11 15444
First, the combination of ITEM and CITY need to be grouped. Then for each group, I want to resample weekly up to 5 datapoints and fill the 'VALUE' column with zero where there is no value for the combination of FIRST_USE_Y and FIRST_USE_W columns. START_W and FIRST_USE_W are the number of week of the year(value can be from 1 to 52).
I tried with pandas and a for loop; it worked. But as it is a very large dataset with many millions of rows and I am bound to do with SQL(where I am a newbie). This is the code I tried:
WITH RECURSIVE weekly_intervals AS (
SELECT MIN(start_w) AS start_w, MAX(start_w) AS end_w
FROM citywise_values
UNION ALL
SELECT start_w + INTERVAL 1 WEEK, end_w
FROM weekly_intervals
WHERE start_w + INTERVAL 1 WEEK <= end_w
),
filled_values AS (
SELECT
w.item,
w.city,
w.start_y,
w.start_w,
COALESCE(cv.value, 0) AS value
FROM
(SELECT
item,
city,
start_y,
start_w
FROM
citywise_values
GROUP BY
item, city) w
LEFT JOIN
citywise_values cv ON w.item = cv.item
AND w.city = cv.city
AND w.start_y = cv.start_y
AND w.start_w = cv.start_w
)
SELECT
item,
city,
start_y,
start_w,
COALESCE(value, LAG(value) OVER (PARTITION BY item, city, start_y ORDER BY start_w)) AS value
FROM
filled_values
RIGHT JOIN
weekly_intervals
ON
filled_values.start_w = weekly_intervals.start_w
ORDER BY
item, city, start_y, start_w
Then I tried with a cross join and was able to produce the result only for one single combination of the ITEM and CITY. But I could not find how to do for the whole dataset.
I am not sure that I could explain it well or not. So, I am posting the desired output which I created manually.
ITEM CITY START_Y START_W FIRST_USE_Y FIRST_USE_W VALUE
A NEW YORK 2023 30 2023 30 0
A NEW YORK 2023 30 2023 31 0
A NEW YORK 2023 30 2023 32 15000
A NEW YORK 2023 30 2023 33 0
A NEW YORK 2023 30 2023 34 0
A LONDON 2024 2 2024 2 12000
A LONDON 2024 2 2024 3 0
A LONDON 2024 2 2024 4 0
A LONDON 2024 2 2024 5 50000
A LONDON 2024 2 2024 6 0
B NEW YORK 2023 49 2023 49 0
B NEW YORK 2023 49 2023 50 0
B NEW YORK 2023 49 2023 51 0
B NEW YORK 2023 49 2023 52 0
B NEW YORK 2023 49 2024 1 19540
B MADRID 2023 10 2023 10 0
B MADRID 2023 10 2023 11 15444
B MADRID 2023 10 2023 12 0
B MADRID 2023 10 2023 13 0
B MADRID 2023 10 2023 14 0
Any help will be appreciated.
There are three items here that make this query tricky:
generate_series()
in SQL Server 2022 or a recursive CTE)I also see this:
START_W and FIRST_USE_W are the number of week of the year(value can be from 1 to 52).
There are more than 52 weeks in a year!
Every year will have a partial week number 53 with at least one or two days. You need to be able to account for this.
I came up with this, which even used a CROSS JOIN:
WITH ItemCity As (
SELECT Item, City, MIN( DATEADD(day, Start_W*7, DATEFROMPARTS(Start_Y, 1, 1)) ) As StartWeek
FROM Data
GROUP BY Item, City
),
ItemCityWeeks As (
SELECT Item,City, StartWeek
,Year(StartWeek) As Start_Y,datepart(week, StartWeek)-1 As Start_W
,YEAR(DATEADD(day, Weeks.num*7, StartWeek)) As First_Use_Y
,DATEPART(dayofyear, DATEADD(day, Weeks.num*7, StartWeek))/7 As First_Use_W
FROM ItemCity
CROSS JOIN ( VALUES (0), (1), (2), (3), (4)) Weeks(num)
)
SELECT icw.Item, icw.City
, icw.Start_Y, icw.Start_W, icw.First_Use_Y, icw.First_Use_W
, coalesce(d.value, 0) as Value
FROM ItemCityWeeks icw
LEFT JOIN Data d ON d.Item = icw.Item AND d.City = icw.City
and d.First_Use_Y = icw.First_Use_Y and d.First_Use_W = icw.First_Use_W
ORDER BY Item, City DESC
See it work here:
Also note my First Use weeks are off by the one for the first city. I believe this is an error in the manually-created sample results, as the other cities all start with the same week as the Start_W, and this one city starts one week later.