sql-server database-design database-performance sql-server-2014 database-tuning

Do gaps in the Identity (primary key clustered) table affects performance of database?

Alright here comes the million dollar question

Assume that i have the following table

SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

SET ANSI_PADDING ON
GO

CREATE TABLE [dbo].[tblUsersProfile](
    [personId] [int] IDENTITY(1,1) NOT NULL,
    [personName] [varchar](16) NOT NULL,
    [personSurName] [varchar](16) NOT NULL,
 CONSTRAINT [PK_tblUsersProfile] PRIMARY KEY CLUSTERED 
(
    [personId] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90) ON [PRIMARY]
) ON [PRIMARY]

GO

SET ANSI_PADDING OFF
GO

Now lets say this table has 2 million records So the next identity id is 2,000,001

However i am making mass delete and the number of records becomes 45,321 However next identity id will be 2,000,001 still

So reordering table with recreation would make any difference

In the first case there will be 45,321 records however identity id will be 2,000,001

In the second case there will be again 45,321 records however identity id will be 45,322

So would there be any performance, storage etc difference between these 2 cases?

Thank you

SQL Server 2014

Solution

Expanding on my comment to explain a bit further. For clarity the comment was:

No, there will be no impact on performance. Since this is your clustering key, whether or not the seed is 45,322, or 2,000,0001, the record will still be entered onto the next free space on the clustered index after the record 45,321. The value of an Identity column is intended to be meaningless, if it is not you are probably not using correctly. After a big delete like that you may end up with some index fragmentation, but the identity seed is completely unrelated to this.

With regard to the fragmentation, on a very simplified example you might have a table with 5 pages, with 100 records per page:

Page 1 (IDs 1 - 100)
Page 2 (IDs 101 - 200)
Page 3 (IDs 201 - 300)
Page 4 (IDs 301 - 400)
Page 5 (IDs 401 - 500)

Now, if you do your delete, and remove all records where the last digit is not 1, and all records where the ID is over 300 you get:

Page 1 (IDs 1, 11, 21, 31, 41, 51, 61, 71, 81, 91)
Page 2 (IDs 11, 111, 121, 131, 141, 151, 161, 171, 181, 191)
Page 3 (IDs 21, 211, 221, 231, 241, 251, 261, 271, 281, 291)
Page 4 (Empty)
Page 5 (Empty)

When we now insert to this table, whether the next identity is 291, or 501, it doesn't change anything. The pages must remain in the correct order, so the highest ID is 291, so the next record must be inserted after that, on the same page if there is space, otherwise a new page is created. In this case there are 9 empty slots on the 3rd page, so the next record can be inserted there. Since both 292, and 500 are higher than 291, the behaviour is the same.

In both cases the issue remains that after the delete you have 3 pages with lots of free space (only 10% full), you now only have 30 records, which will happily fit on one page, so you could rebuild your index to do this, so that now you only need to read a single page to get all your data.

Again, I stress, that this is a very simple example, and I would not suggest rebuilding a clustered index to free up 2 pages!

It is also important to stress that this behaviour is because the ID column is the clustering key, NOT the primary key. They are not necessarily one and the same, however, if you were clustering on something other than your identity column it would still make no difference to performance if you were to reseed it or not after the delete. Identity columns are there for purely that, identity, the actual value is irrelevant as long as you can uniquely identify a row.

SAMPLE TEST CODE

-- CREATE TABLE AND FILL WITH 100,000 ROWS
IF OBJECT_ID(N'dbo.DefragTest', 'U') IS NOT NULL
    DROP TABLE dbo.DefragTest;

CREATE TABLE dbo.DefragTest (ID INT IDENTITY(1, 1) PRIMARY KEY, Filler CHAR(1) NULL);
INSERT dbo.DefragTest (Filler)
SELECT TOP 100000 NULL
FROM sys.all_objects AS a, sys.all_objects AS b;


-- CHECK PAGE STATISTICS
SELECT  Stage = 'After Initial Insert',
        IdentitySeed = IDENT_CURRENT(N'dbo.DefragTest'),
        p.rows, 
        a.total_pages,
        a.data_pages,
        AvgRecordsPerPage = CAST(p.rows / CAST(a.data_pages AS FLOAT) AS DECIMAL(10, 2))
FROM    sys.partitions AS p
        LEFT JOIN sys.allocation_units AS a
            ON a.container_id = p.partition_id
WHERE   p.[object_id] = OBJECT_ID(N'dbo.DefragTest', 'U')
AND     p.index_id IN (0, 1); -- CLUSTERED OR HEAP

-- DELETE RECORDS
DELETE  dbo.DefragTest
WHERE   ID % 10 != 1
OR      ID > 50000;

-- CHECK PAGE STATISTICS
SELECT  Stage = 'After Delete',
        IdentitySeed = IDENT_CURRENT(N'dbo.DefragTest'),
        p.rows, 
        a.total_pages,
        a.data_pages,
        AvgRecordsPerPage = CAST(p.rows / CAST(a.data_pages AS FLOAT) AS DECIMAL(10, 2))
FROM    sys.partitions AS p
        LEFT JOIN sys.allocation_units AS a
            ON a.container_id = p.partition_id
WHERE   p.[object_id] = OBJECT_ID(N'dbo.DefragTest', 'U')
AND     p.index_id IN (0, 1); -- CLUSTERED OR HEAP  

-- RESEED (REMOVED FOR ONE RUN)
DBCC CHECKIDENT ('dbo.DefragTest', RESEED, 50000);

--INSERT ROWS TO SEE EFFECT ON PAGE
INSERT dbo.DefragTest (Filler)
SELECT TOP 10000 NULL
FROM sys.all_objects AS a;

-- CHECK PAGE STATISTICS
SELECT  Stage = 'After Second Insert',
        IdentitySeed = IDENT_CURRENT(N'dbo.DefragTest'),
        p.rows, 
        a.total_pages,
        a.data_pages,
        AvgRecordsPerPage = CAST(p.rows / CAST(a.data_pages AS FLOAT) AS DECIMAL(10, 2))
FROM    sys.partitions AS p
        LEFT JOIN sys.allocation_units AS a
            ON a.container_id = p.partition_id
WHERE   p.[object_id] = OBJECT_ID(N'dbo.DefragTest', 'U')
AND     p.index_id IN (0, 1); -- CLUSTERED OR HEAP  

-- CHECK READS REQUIRED FOR FULL TABLE SCAN
SET STATISTICS IO ON;
SELECT COUNT(Filler)
FROM dbo.DefragTest;

-- REBUILD INDEX
ALTER INDEX PK_DefragTest__ID ON dbo.DefragTest REBUILD;

-- CHECK PAGE STATISTICS
SELECT  Stage = 'After Index Rebuild',
        IdentitySeed = IDENT_CURRENT(N'dbo.DefragTest'),
        p.rows, 
        a.total_pages,
        a.data_pages,
        AvgRecordsPerPage = CAST(p.rows / CAST(a.data_pages AS FLOAT) AS DECIMAL(10, 2))
FROM    sys.partitions AS p
        LEFT JOIN sys.allocation_units AS a
            ON a.container_id = p.partition_id
WHERE   p.[object_id] = OBJECT_ID(N'dbo.DefragTest', 'U')
AND     p.index_id IN (0, 1); -- CLUSTERED OR HEAP  

-- CHECK READS REQUIRED FOR FULL TABLE SCAN

SELECT COUNT(Filler)
FROM dbo.DefragTest;

SET STATISTICS IO OFF;

Output with Reseed:

Stage                   IdentitySeed    rows    total_pages     data_pages  AvgRecordsPerPage
After Initial Insert    100000          100000  178             174         574.71
After Delete            100000          5000    90              87          57.47
After Second Insert     52624           7624    98              91          83.78
After Index Rebuild     52624           7624    18              14          544.57

Table 'DefragTest'. Scan count 1, logical reads 93 (Count before rebuild)

Table 'DefragTest'. Scan count 1, logical reads 16 (Count after rebuild)

Output without Reseed:

Stage                   IdentitySeed    rows    total_pages     data_pages  AvgRecordsPerPage
After Initial Insert    100000          100000  178             174         574.71
After Delete            100000          5000    90              87          57.47
After Second Insert     102624          7624    98              91          83.78
After Index Rebuild     52624           7624    18              14          544.57

Table 'DefragTest'. Scan count 1, logical reads 93 (Count before rebuild)

Table 'DefragTest'. Scan count 1, logical reads 16 (Count after rebuild)

As you can see, in each case there is no difference, in the way the data is stored or read, it is only the value of IDENT_INCR() that changes, and in both cases rebuilding the clustered index drastically reduces the number of pages, which in turn improves query performance since there are less logical reads to get the same amount of data.