sqlsql-serverperformancesql-server-2008-r2sql-tuning

High performance query to get datetime when value was last changed


The data I am working with

Consider the following 2 database tables:

CREATE TABLE [dbo].[Contact](
    [ID] [int] IDENTITY(1,1) NOT NULL,
    [Contact_UID] [uniqueidentifier] NOT NULL CONSTRAINT [DF_Contact_Contact_UID]  DEFAULT (newsequentialid()),
    [Name] [nvarchar](255) NOT NULL,
    [ContactStatus] [nvarchar](255) NOT NULL)

CREATE TABLE [dbo].[Contact_Log](
    [ID] [int] IDENTITY(1,1) NOT NULL,
    [LogDate] [datetimeoffset](7) NOT NULL CONSTRAINT [DF_Contact_Log_LogDate]  DEFAULT (sysdatetimeoffset()),
    [Contact_UID] [uniqueidentifier] NOT NULL CONSTRAINT [DF_Contact_Log_Contact_UID]  DEFAULT (newsequentialid()),
    [Name] [nvarchar](255) NOT NULL,
    [ContactStatus] [nvarchar](255) NOT NULL)

The Contact table is the main table for contact records. It stores the name of the contact and the status (e.g. "Alive", "Dead" or whatever).

The Contact_Log table stores all changes made to the Contact table.

So here is some sample data:

Contact:

+----+--------------------------------------+------+---------------+
| ID | Contact_UID                          | Name | ContactStatus |
+----+--------------------------------------+------+---------------+
| 1  | 62918AC1-1C6C-4DEB-B7F8-5D5EF913F667 | John | Dead          |
+----+--------------------------------------+------+---------------+
| 2  | F7844037-2FF5-47B9-874D-C0920E7DC092 | Jane | Alive         |
+----+--------------------------------------+------+---------------+

Contact_Log:

+----+--------------------------------------+------+---------------+------------+
| ID | Contact_UID                          | Name | ContactStatus | LogDate    |
+----+--------------------------------------+------+---------------+------------+
| 1  | 62918AC1-1C6C-4DEB-B7F8-5D5EF913F667 | John | Alive         | 2019-01-01 |
+----+--------------------------------------+------+---------------+------------+
| 2  | 62918AC1-1C6C-4DEB-B7F8-5D5EF913F667 | John | Dead          | 2019-01-02 |
+----+--------------------------------------+------+---------------+------------+
| 3  | 62918AC1-1C6C-4DEB-B7F8-5D5EF913F667 | John | Dead          | 2019-01-03 |
+----+--------------------------------------+------+---------------+------------+
| 4  | F7844037-2FF5-47B9-874D-C0920E7DC092 | Jane | Alive         | 2019-01-04 |
+----+--------------------------------------+------+---------------+------------+

NOTE: At this point I haven't added any indexes or anything like that on that tables.

Test Scenario

The above is just some sample data. The data I am testing on has the following row counts:

Contact: ~10,000 rows

Contact_Log: ~3,000,000 rows

I am currently using SQL Server 2008 R2 for testing. So a solution that is supported in that and later is preferred.

What I am trying to achieve

Basically I am trying to formulate a query that can tell me the LogDate for when the ContactStatus field was last changed, for a specific Contact_UID, taken from the Contact_Log table.

For example, if the record I was interested in was "John", then the result should be "2019-01-02". As this is the date that John's ContactStatus last changed (i.e. it changed from "Alive" to "Dead").

Ultimately, I want to put this query into a function. A function that can be called by passing in the Contact_UID and the name of the field I want to check. This function could then be called as part of a more general query. For example:

SELECT Name, MyFunction('62918AC1-1C6C-4DEB-B7F8-5D5EF913F667', 'ContactStatus') AS StatusLastChanged FROM Contact

What I have tried so far

Well, I have tried a few things, and although I can get the results I want. My attempts are really struggling with performance problems.

NOTE: Although I only really want a single datetimeoffset result. Some attempts include more data/fields simply to try and validate the data is accurate.

Attempt 1:

SELECT TOP(1) a.LogDate
FROM Contact_Log AS a
WHERE a.Contact_UID = '62918AC1-1C6C-4DEB-B7F8-5D5EF913F667' 
AND a.ContactStatus <>
(
SELECT TOP(1) b.ContactStatus
FROM Contact_Log AS b
WHERE b.Contact_UID = '62918AC1-1C6C-4DEB-B7F8-5D5EF913F667'
AND a.LogDate > b.LogDate
ORDER BY b.LogDate DESC
)
ORDER BY LogDate DESC

Problem 1: Too slow. I had to stop the query after nearly an hour of waiting with no results.

Attempt 2:

SELECT A.LogDate
FROM (SELECT ROW_NUMBER() OVER (ORDER BY LogDate DESC, ID DESC) AS rnum, ID, LogDate, Contact_UID, ContactStatus FROM Contact_Log) A
LEFT JOIN (SELECT ROW_NUMBER() OVER (ORDER BY LogDate DESC, ID DESC) AS rnum, ID, LogDate, Contact_UID, ContactStatus FROM Contact_Log) B
ON A.rnum = B.rnum-1
WHERE 
(B.rnum IS NULL
OR (A.Contact_UID = '62918AC1-1C6C-4DEB-B7F8-5D5EF913F667' 
AND B.Contact_UID = '62918AC1-1C6C-4DEB-B7F8-5D5EF913F667'
AND A.ContactStatus != B.ContactStatus))
ORDER BY A.rnum

Problem 2: This works and gives me the correct data set. However it takes 6 seconds which is too slow. Remember it will need to work as a function in a more general query (with ~10,000 rows).

Attempt 3: Now this is basically the same as attempt 2, expect I tried to apply TOP(1) so that I can get the result that I actually want.

SELECT TOP(1) A.LogDate
FROM (SELECT ROW_NUMBER() OVER (ORDER BY LogDate DESC, ID DESC) AS rnum, ID, LogDate, Contact_UID, ContactStatus FROM Contact_Log) A
LEFT JOIN (SELECT ROW_NUMBER() OVER (ORDER BY LogDate DESC, ID DESC) AS rnum, ID, LogDate, Contact_UID, ContactStatus FROM Contact_Log) B
ON A.rnum = B.rnum-1
WHERE 
(B.rnum IS NULL
OR (A.Contact_UID = '62918AC1-1C6C-4DEB-B7F8-5D5EF913F667' 
AND B.Contact_UID = '62918AC1-1C6C-4DEB-B7F8-5D5EF913F667'
AND A.ContactStatus != B.ContactStatus))
ORDER BY A.rnum

Problem 3: To my surprise, this takes a lot longer than attempt 2, even though all I did was add TOP(1) at the start. This took over 5 minutes so I stopped the query and gave up.

Question

How can I do what I want in "What I am trying to achieve", but with a reasonable amount of performance? (I would be happy getting it under 1 second at this stage).

Remember, I just want a single datetimeoffset as the result so it can be used in a function.

So far I have no specific indexes created. I am happy to consider suggestions for those as a suitable answer, if no improvement to the query are possible. Or any changes to the schema that is appropriate.

Bottom Line

I am looking for a query that will produce 1 result, with 1 datetimeoffset field. It needs to take less than 1 second to run.


Solution

  • You want to select the smallest date right after the highest date that does not equal the current ContactStatus. That would be something like this:

    select
    min(LogDate)
    from Contact_Log
    where
    Contact_UID='62918AC1-1C6C-4DEB-B7F8-5D5EF913F667'
    and ContactStatus = (
            select top 1
            ContactStatus
            from Contact_Log where
            Contact_UID='62918AC1-1C6C-4DEB-B7F8-5D5EF913F667'
            order by Log_Date desc
            )
    and LogDate > (
        select max(LogDate)
        from Contact_Log
        where Contact_UID='62918AC1-1C6C-4DEB-B7F8-5D5EF913F667'
        and ContactStatus != (
            select top 1
            ContactStatus
            from Contact_Log where
            Contact_UID='62918AC1-1C6C-4DEB-B7F8-5D5EF913F667'
            order by Log_Date desc
            )
        );