sql-servert-sqlquery-optimization

Fastest way to find which hours have 0 rows in a table with a month's data?


I have a Very Large/slow table (35gb data, 70gb indexes) on an Azure SQL MI that's roughly this:

create table a (id int identity, mydate datetime, fielda varchar(10), fieldb money, fieldc char(50), fieldd datetime)
create index ncidx__a__mydate on a (mydate)

I want to get a list of which hours have 0 rows, though which hours have any rows is close enough (bang it against a numbers table and you can get a list). I was trying to think of the bare minimum needed.

I could do something like this, but would need 30*24 of them for each month. Seems like something a function with a cross-apply could do.

SELECT * FROM (SELECT '20240703 1am test' res) a WHERE EXISTS (SELECT * FROM dbo.mytable WHERE mydate >='20240701 01:00:00' AND mydate <'20240701 02:00:00')

I came up with this (based off an Itzik Ben-Gan query), which... works, and is good enough for me for now. (3 minutes for a month, 200m rows, 34gb data, 70gb of indexes including the one you see. It does 715992 logical reads).

SELECT orderday, orderhour, MIN(mydate) AS mindate
FROM mytable
CROSS APPLY (VALUES(day(mydate), datepart(HOUR,mydate)))
               AS D(orderday, orderhour)
WHERE mydate >='20240501' AND mydate <'20240601'
GROUP BY orderday, orderhour

But is there a faster way? I don't need a min date, I just need to know if there are ANY rows in there. It seems like it could be almost instant, but the logic of doing that is beyond me. Thanks.


Solution

  • You are on the right track with the mydate index. Most solutions would likely involve either a loop join and an index seek (probe) into that index, or a range scan of the relevant portion of the index (better than a table scan) that is fed to logic that looks for gaps.

    The most obvious and possibly the most efficient solution would be to generate a range of data/time values that is then filtered by a where not exists(...) clause. Something like:

    declare @FromDate datetime = '2024-07-01'
    declare @ToDate datetime = dateadd(month, 1, @FromDate) -- Exclusive
    declare @DiffHours int = datediff(hour, @FromDate, @ToDate)
    
    -- Using not exists
    -- Loop join and index seek (probe)
    select h.Hour
    from generate_series(0, @DiffHours - 1) s
    cross apply (select dateadd(hour, s.value, @FromDate) as Hour) h
    where not exists (
        select *
        from a
        where a.mydate >= h.Hour
        and a.mydate < dateadd(hour, 1, h.Hour)
    )
    order by h.Hour
    

    For a one month range, the index would be probed by the exists clause about 750 times.

    Another approach would be to scan the date range, reduce it to a set of distinct hours values and then subtract those values from a generated list of all hours. The subtract operation could be performed using except or with a left join ... where right-side is null construct.

    declare @FromDate datetime = '2024-07-01'
    declare @ToDate datetime = dateadd(month, 1, @FromDate) -- Exclusive
    declare @DiffHours int = datediff(hour, @FromDate, @ToDate)
    
    -- Using except
    -- Index seek (range scan) and merge join (or spool + loop join)
    select h.Hour
    from generate_series(0, @DiffHours - 1) s
    cross apply (select dateadd(hour, s.value, @FromDate) as Hour) h
    except
    select distinct datetrunc(hour, a.mydate) as Hour
    from a
    where a.mydate >= @FromDate
    and a.mydate < @ToDate
    order by Hour
    
    -- Using left join
    -- Index seek (range scan) and merge join (or spool + loop join)
    -- (Slighly different from the "except" plan)
    select h.Hour
    from generate_series(0, @DiffHours - 1) s
    cross apply (select dateadd(hour, s.value, @FromDate) as Hour) h
    left join (
        select distinct datetrunc(hour, a.mydate) as Hour
        from a
        where a.mydate >= @FromDate
        and a.mydate < @ToDate
    ) h2
        on h2.Hour = h.Hour
    where h2.Hour is null
    order by H.Hour
    

    Finally, an ordered scan of the date range can be performed, checking for gaps using the LAG() window function to compare the current value to the prior. A UNION ALL is included to ensure that we have the end-cases (first and last hour) covered.

    There might be a slight performance gain if the UNION operations are factoring out to then next outer query level, since data is already sorted, but the cost is two more index seeks. Both versions follow.

    declare @FromDate datetime = '2024-07-01'
    declare @ToDate datetime = dateadd(month, 1, @FromDate) -- Exclusive
    
    -- Index seek (range scan), using lag() to check for gaps.
    -- A union-all is used to ensure we have the end-cases covered.
    select
        dd.MissingHours,
        dateadd(hour, 1, dp.PriorHour) as FromHour,
        dateadd(hour, -1, Hour) as ThruHour
    from (
        select
            d.Hour,
            lag(d.Hour) over(order by d.Hour) as PriorHour 
        from (
            select dateadd(hour, -1, @FromDate) as Hour
            union all 
            select @ToDate as Hour
            union all 
            select datetrunc(hour, a.mydate) as Hour
            from a
            where a.mydate >= @FromDate
            and a.mydate < @ToDate
        ) d
    ) dp
    cross apply (select datediff(hour, dp.PriorHour, dp.Hour) - 1 as MissingHours) dd
    where dd.MissingHours >= 1
    order by dp.PriorHour
    
    -- Index seek (range scan), using lag() to check for gaps.
    -- Factoring out this union /might/ yield a slight performance gain,
    -- since data is already sorted, but the cist is two more index seeks.
    select
        dd.MissingHours,
        dateadd(hour, 1, dp.PriorHour) as FromHour,
        dateadd(hour, -1, Hour) as ThruHour
    from (
        select
            d.Hour,
            lag(d.Hour) over(order by d.Hour) as PriorHour 
        from (
            select datetrunc(hour, a.mydate) as Hour
            from a
            where a.mydate >= @FromDate
            and a.mydate < @ToDate
        ) d
        union all
        select
            datetrunc(hour, min(a.mydate)) as Hour,
            dateadd(hour, -1, @FromDate) as PriorHour
        from a
        where a.mydate >= @FromDate
        and a.mydate < @ToDate
        union all
        select
            @ToDate as Hour,
            datetrunc(hour, max(a.mydate)) as PriorHour
        from a
        where a.mydate >= @FromDate
        and a.mydate < @ToDate
    ) dp
    cross apply (select datediff(hour, dp.PriorHour, dp.Hour) - 1 as MissingHours) dd
    where dd.MissingHours >= 1
    order by dp.PriorHour
    

    The above yields ranges instead of individual rows for each missing hour. This might be desirable. If not, the above can be modified with a GENERATE_SERIES() and DATEADD() to expand each resulting range.

    select dateadd(hour, s.value, dp.PriorHour) as Hour
    from (
        ...
    ) dp
    cross apply (select datediff(hour, dp.PriorHour, dp.Hour) - 1 as MissingHours) dd
    cross apply generate_series(1, dd.MissingHours) s
    where dd.MissingHours >= 1
    order by Hour
    

    See this db<>fiddle for a demo of each of the above with execution plans using some generated test data having 1-, 3-, and 5-hour gaps.

    You can run performance tests against your live data to see what works best for you.

    Sample results (discrete hours and hour ranges):

    Hour
    2024-07-05 04:00
    2024-07-11 10:00
    2024-07-11 11:00
    2024-07-11 12:00
    2024-07-11 13:00
    2024-07-22 21:00
    2024-07-22 22:00
    2024-07-22 23:00
    2024-07-23 00:00
    2024-07-23 01:00
    2024-07-23 02:00
    MissingHours FromHour ThruHour
    1 2024-07-05 04:00 2024-07-05 04:00
    3 2024-07-11 10:00 2024-07-11 12:00
    5 2024-07-22 21:00 2024-07-23 01:00

    OLDER SQL SERVER VERSIONS

    The above SQL solutions use the DATE_TRUNC() and/or GENERATE_SERIES() functions available in SQL Server 2022 and later.

    For earlier versions of SQL Server, the DATE_TRUNC(hour, xxx) function can be replaced with the calculation dateadd(hour, datediff(hour, 0, xxx), 0). The GENERATE_SERIES() function can be replaced with a number generator wrapped up in a CTE or cross apply. There are many techniques out there, but for this application, you can use a variation of the following:

    cross apply (
        select top (n) -- 1..n
            row_number() over(order by (select null)) as value
        from (values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)) n1(n) -- Up to 10
        cross join (values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)) n2(n) -- Up to 100
        cross join (values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)) n3(n) -- Up to 1000
    ) s
    

    The row_number() function produces numbers starting at 1. Subtracting 1 and adjusting the top (n) value is needed for zero-based or other-based ranges. Add more cross joins if more than 1000 values are needed.

    See this db<>fiddle for a demo that works for earlier SQL Server versions back to at least 2014.