sqlsql-servert-sqlsql-server-2016

Efficiently comparing DATE and DATETIME columns in a SQL query


I've got tables with millions (possibly billions) of rows, so I really need to be efficient with the query.

In this query, I'm joining multiple tables. The segment in question is:

LEFT JOIN 
    dbo.GCSOCTPS dbo_GCSOCTPS ON (GC_TBMED.MED_CLASS_NUM = dbo_GCSOCTPS.CLASS_NUM) 
                              AND (GC_TBMED.MED_SOC_NUM = dbo_GCSOCTPS.SOC_NUM)
                              AND (GC_TBMED.MED_EFF_DATE = dbo_GCSOCTPS.EFF_DATE)
                              AND (GC_TBMED.MED_CANC_DATE = dbo_GCSOCTPS.CANC_DATE)) 

GC_TBMED has dates in DATE format, dbo_GCSOCTPS has dates in DATETIME format. Sadly, due to the way our company uses data, I can't change that.

What would be the most efficient way to compare these columns? CAST? CONVERT? I've seen people convert to Text format and compare that way. I'm hoping someone could guide me based on their experience with very large datasets.

If I run this block of code, I get a value for HDHPQ:

SELECT TOP 200 
    HDHPQ,
    SOC_NUM, 
    EFF_DATE,
    CLASS_NUM,
    CANC_DATE
FROM
    dbo.GCSOCTPS
WHERE
    SOC_NUM = '25521'
    AND CLASS_NUM = '37'
    AND CANC_DATE IS NULL;

This is returned:

HDHPQ    SOC_NUM    EFF_DATE                  CLASS_NUM    CANC_DATE 
N         25521    2025-01-01 00:00:00.000      37         NULL

If I run this block of code, I get data returned as well:

SELECT TOP 200 
    MED_SOC_NUM,
    MED_EFF_DATE,
    MED_CLASS_NUM,
    MED_CANC_DATE
FROM
    [dbo].[AS_tblTBMED] GC_TBMED
WHERE 
    GC_TBMED.MED_SOC_NUM = '25521'
    AND GC_TBMED.MED_CLASS_NUM = '37'
    AND GC_TBMED.MED_CANC_DATE IS NULL;

This gets returned:

MED_SOC_NUM    MED_EFF_DATE    MED_CLASS_NUM    MED_CANC_DATE
25521        2025-01-01        37                NULL

Each of those returns one row. I need to join them so that I get all of the data from the second query and the value of HDHPQ from the first query.

So I run this query:

SELECT DISTINCT TOP 200 
    dbo_GCSOCTPS.HDHPQ,
    dbo_GCSOCTPS.SOC_NUM, 
    dbo_GCSOCTPS.EFF_DATE,
    dbo_GCSOCTPS.CLASS_NUM,
    dbo_GCSOCTPS.CANC_DATE,
    GC_TBMED.MED_SOC_NUM,
    GC_TBMED.MED_EFF_DATE,
    GC_TBMED.MED_CLASS_NUM,
    GC_TBMED.MED_CANC_DATE
FROM
    [dbo].[AS_tblTBMED] GC_TBMED
LEFT JOIN 
    dbo.GCSOCTPS dbo_GCSOCTPS ON (GC_TBMED.MED_CLASS_NUM = dbo_GCSOCTPS.CLASS_NUM) 
                              AND (GC_TBMED.MED_SOC_NUM = dbo_GCSOCTPS.SOC_NUM)
                              AND (GC_TBMED.MED_EFF_DATE = CAST(dbo_GCSOCTPS.EFF_DATE as DATE))
                              AND (GC_TBMED.MED_CANC_DATE = CAST(dbo_GCSOCTPS.CANC_DATE as DATE)) 
WHERE
    GC_TBMED.MED_SOC_NUM = '25521'
    AND GC_TBMED.MED_CLASS_NUM = '37'
    AND GC_TBMED.MED_CANC_DATE IS NULL
    AND dbo_GCSOCTPS.EFF_DATE >= '2025-01-01';

And an empty recordset is returned. If I comment out the two dates in the join, I get data. So, my assumption is that the date fields aren't properly equating, since they are equal and therefore I should get data if they're included in the query.


Solution

  • You have (at least) two problems here:

    1. LEFT JOIN and WHERE on dbo_GCSOCTPS make things into INNER JOIN
    2. You're joining on NULL fields, but <anything> = NULL will never be true so the join will fail.

    You probably need something along the lines of:

    select DISTINCT top 200 
    gcs.HDHPQ,
    gcs.SOC_NUM, 
    gcs.EFF_DATE,
    gcs.CLASS_NUM,
    gcs.CANC_DATE,
    tbm.MED_SOC_NUM,
    tbm.MED_EFF_DATE,
    tbm.MED_CLASS_NUM,
    tbm.MED_CANC_DATE
    from [dbo].[AS_tblTBMED] tbm
    LEFT JOIN dbo.GCSOCTPS gcs 
        ON tbm.MED_CLASS_NUM = gcs.CLASS_NUM
        AND tbm.MED_SOC_NUM = gcs.SOC_NUM
        AND tbm.MED_EFF_DATE = CAST(gcs.EFF_DATE as DATE)
        AND gcs.CANC_DATE IS NULL
    where tbm.MED_SOC_NUM = '25521'
    and tbm.MED_CLASS_NUM = '37'
    and tbm.MED_CANC_DATE is NULL
    and tbm.MED_EFF_DATE >= '20250101';
    

    There are a couple of more suspicious lines though:

    1. CAST(gcs.EFF_DATE as DATE) if EFF_DATE doesn't contain hours etc, you can skip the cast probably
    2. tbm.MED_CLASS_NUM = '37' why do you use '37' and not just 37 for columns which seems to be a number?
    3. DISTINCT and TOP is seldom correct combination, if you have duplicates, you probably need to rethink your query