I have a data set that I want to place in a clustered column store and optimize it for segment elimination when accessing it by a column called SubjectId
which is defined as uniqueidentifier
type.
I'm using an Azure SQL database with a compatibility level of 16. I have confirmed this same behavior in SQL Server Developer Edition 2022. According to the documentation, this version should support segment elimination based on a uniqueidentifier column.
Following various best practices, I first take my data set and create a rowstore clustered index ordered by that SubjectId
:
CREATE CLUSTERED INDEX [MyData_CCI]
ON [dbo].[MyData_CCS] (SubjectId)
WITH (MAXDOP = 1);
I then create the clustered document store using the DROP EXISTING
option:
CREATE CLUSTERED COLUMNSTORE INDEX [MyData_CCI]
ON [dbo].[MyData_CCS]
WITH (DROP_EXISTING = ON, MAXDOP = 1);
The resulting segments don't appear to be aligned whatsoever to this SubjectId
:
And when I try and query the data using a single SubjectId
in the where clause, the statistics show zero segment elimination:
(44 rows affected)
Table 'MyData_CCS'. Scan count 1, logical reads 0, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 2769, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0. Table 'MyData_CCS'. Segment reads 77, segment skipped 0.
What am I doing wrong here? Isn't SQL Server 2022+ (and therefore SQL Azure?) supposed to support uniqueidentifier
for segment elimination and predicate push down? As I said, this also happens with SQL Dev Edition 2022, so I don't think it's an Azure thing.
You can repro this yourself with the following schema/data generation script:
IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[MyData]') AND type in (N'U'))
DROP TABLE [dbo].[MyData]
GO
CREATE TABLE [dbo].[MyData](
[SubjectId] [uniqueidentifier] NOT NULL,
[SomeDateTime] [datetime] NOT NULL,
[SomeInteger] [int] NOT NULL,
[SomeString] [nvarchar](max) NOT NULL
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
GO
-- Create a temporary table to store unique SubjectId values
DROP TABLE #SubjectIds
CREATE TABLE #SubjectIds (
SubjectId UNIQUEIDENTIFIER
);
GO
WITH NumberSeries AS (
SELECT 1 AS Number
UNION ALL
SELECT Number + 1
FROM NumberSeries
WHERE Number < 42000
)
INSERT INTO #SubjectIds (SubjectId)
SELECT NEWID()
FROM NumberSeries
OPTION (MAXRECURSION 0);
-- Insert 21 million rows using cross join
INSERT INTO MyData (SubjectId, SomeDateTime, SomeInteger, SomeString)
SELECT
s.SubjectId,
DATEADD(SECOND, ABS(CHECKSUM(NEWID()) % 31536000), '2020-01-01'), -- Random date within a year
ABS(CHECKSUM(NEWID()) % 1000000), -- Random integer
REPLICATE(N'A', 100) -- Example string, adjust as needed
FROM
#SubjectIds s
CROSS JOIN (SELECT TOP (500) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS n FROM master..spt_values) AS x;
CREATE CLUSTERED INDEX [MyData_CCI]
ON [dbo].[MyData] (SubjectId)
WITH (MAXDOP = 1, DROP_EXISTING = ON);
GO
CREATE CLUSTERED COLUMNSTORE INDEX [MyData_CCI]
ON [dbo].[MyData]
WITH (DROP_EXISTING = ON, MAXDOP = 1);
--Show Segment data
select s.Name as SchemaName,
t.Name as TableName,
i.Name as IndexName,
c.name as ColumnName,
c.column_id as ColumnId,
cs.segment_id as SegmentId,
cs.min_data_id as MinValue,
cs.max_data_id as MaxValue
from sys.schemas s
join sys.tables t
on t.schema_id = s.schema_id
join sys.partitions as p
on p.object_id = t.object_id
join sys.indexes as I
on i.object_id = p.object_id
and i.index_id = p.index_id
join sys.index_columns as ic
on ic.[object_id] = I.[object_id]
and ic.index_id = I.index_id
join sys.columns c
on c.object_id = t.object_id
and c.column_id = ic.column_id
join sys.column_store_segments cs
on cs.hobt_id = p.hobt_id
and cs.column_id = ic.index_column_id
WHERE t.Name = 'MyData' AND c.Name = 'SubjectId'
ORDER BY cs.segment_id
This was resolved via discovering that the min/max id meta data is only for numeric types. For GUIDs, strings, etc. post SQL 2022, you need to look at the min_deep_data
and max_deep_data
columns. In my case, these were not populated despite having created these column stores brand new on a level 16 compatibility DB in SQL Azure. I do not know why that's the case, but it is.
I was forced to do a REBUILD
with MAXDOP 1
on the affected column stores to populate this meta data, after which I saw more appropriate segment elimination. Specifically:
ALTER INDEX [MyData_CCI] ON [dbo].[MyData] REBUILD WITH (MAXDOP = 1);
I dunno, this worked fine for me.
IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[MyData]') AND type in (N'U'))
DROP TABLE [dbo].[MyData]
GO
CREATE TABLE [dbo].[MyData](
[SubjectId] [uniqueidentifier] NOT NULL,
[SomeDateTime] [datetime] NOT NULL,
[SomeInteger] [int] NOT NULL,
[SomeString] [nvarchar](max) NOT NULL
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
GO
-- Create a temporary table to store unique SubjectId values
DROP TABLE #SubjectIds
CREATE TABLE #SubjectIds (
SubjectId UNIQUEIDENTIFIER
);
GO
WITH NumberSeries AS (
SELECT 1 AS Number
UNION ALL
SELECT Number + 1
FROM NumberSeries
WHERE Number < 42000
)
INSERT INTO #SubjectIds (SubjectId)
SELECT NEWID()
FROM NumberSeries
OPTION (MAXRECURSION 0);
-- Insert 21 million rows using cross join
INSERT INTO MyData (SubjectId, SomeDateTime, SomeInteger, SomeString)
SELECT
s.SubjectId,
DATEADD(SECOND, ABS(CHECKSUM(NEWID()) % 31536000), '2020-01-01'), -- Random date within a year
ABS(CHECKSUM(NEWID()) % 1000000), -- Random integer
REPLICATE(N'A', 100) -- Example string, adjust as needed
FROM
#SubjectIds s
CROSS JOIN (SELECT TOP (500) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS n FROM master..spt_values) AS x;
CREATE CLUSTERED INDEX [MyData_CCI]
ON [dbo].[MyData] (SubjectId)
WITH (MAXDOP = 1);
GO
CREATE CLUSTERED COLUMNSTORE INDEX [MyData_CCI]
ON [dbo].[MyData]
WITH (DROP_EXISTING = ON, MAXDOP = 1);
set statistics io on
select *
from Mydata
where SubjectId = '15A4815B-DDD3-48BB-84DE-A578663D8D91'
/*
Table 'MyData'. Scan count 4, logical reads 0, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 51224, lob physical reads 9, lob page server reads 0, lob read-ahead reads 80541, lob page server read-ahead reads 0.
Table 'MyData'. Segment reads 10, segment skipped 11.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
*/
SELECT @@version
Microsoft SQL Server 2022 (RTM) - 16.0.1000.6 (X64)
Maybe the segment thingy depends on which GUIDs you are selecting?
Edit, had to do a new batch:
select *
from Mydata
where SubjectId = 'B52D5BC0-33B6-44F9-87C3-1CAFD7FDDEA5'
Output:
SchemaName | TableName | IndexName | ColumnName | ColumnId | SegmentId | MinValue | MaxValue | min_deep_data | max_deep_data | encoding_type | primary_dictionary_id | secondary_dictionary_id | row_count |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
dbo | MyData | MyData_CCI | SubjectId | 1 | 0 | 26 | 24899 | 100031AAB70512CE9244B9092BFA4B2EF4B1 | 100035979016BB401242B0796DC167CB1F9E | 3 | 0 | 1 | 1048576 |
dbo | MyData | MyData_CCI | SubjectId | 1 | 1 | 27 | 25837 | 100035979016BB401242B0796DC167CB1F9E | 1000203A7982506BAF45A5CA963EBE4637AF | 3 | 0 | 1 | 1048576 |
dbo | MyData | MyData_CCI | SubjectId | 1 | 2 | 24 | 26749 | 1000203A7982506BAF45A5CA963EBE4637AF | 10004FD8C586BF0F5D4BB109D296FDD032B3 | 3 | 0 | 1 | 1048576 |
dbo | MyData | MyData_CCI | SubjectId | 1 | 3 | 25 | 27685 | 10004FD8C586BF0F5D4BB109D296FDD032B3 | 1000EBD81ADB9A00F94DA589FE85EC39C5C3 | 3 | 0 | 1 | 1048576 |
dbo | MyData | MyData_CCI | SubjectId | 1 | 4 | 22 | 28601 | 1000B8640879660E6541A25E04E74363E39C | 1000F8CC2A1D96B30B4C8687FF1DAC082C2C | 3 | 0 | 1 | 1048576 |
dbo | MyData | MyData_CCI | SubjectId | 1 | 5 | 19 | 29482 | 10003F8ACDB56E2CBB4E9C7935F65E78D3A9 | 1000CD9C58D6438CB845A3CE79EC437123E9 | 3 | 0 | 1 | 1048576 |
dbo | MyData | MyData_CCI | SubjectId | 1 | 6 | 20 | 30368 | 1000CD9C58D6438CB845A3CE79EC437123E9 | 1000006AB21D01564048B767D68755ED8FFF | 3 | 0 | 1 | 1048576 |
dbo | MyData | MyData_CCI | SubjectId | 1 | 7 | 21 | 31299 | 1000006AB21D01564048B767D68755ED8FFF | 1000F151931B7AA63844A40FE37382B66DA4 | 3 | 0 | 1 | 1048576 |
dbo | MyData | MyData_CCI | SubjectId | 1 | 8 | 17 | 32227 | 10002998463928C0034F8DC7067F49C9C803 | 1000E8AF680F3BA2D340A76AFFFE92C5D579 | 3 | 0 | 1 | 1048576 |
dbo | MyData | MyData_CCI | SubjectId | 1 | 9 | 18 | 33119 | 1000C14ABF7C819F784E952E3BD91E0D4059 | 10008733320FD434364890D9946020C5621E | 3 | 0 | 1 | 1048576 |
dbo | MyData | MyData_CCI | SubjectId | 1 | 10 | 15 | 33983 | 10008733320FD434364890D9946020C5621E | 10000A7AFD6104DE574C9561F4634A3D7AEE | 3 | 0 | 1 | 1048576 |
dbo | MyData | MyData_CCI | SubjectId | 1 | 11 | 13 | 34899 | 1000C6A9B411DD37C941A17E00057ABE0045 | 1000750EC694DA9E964CBC4CF5956C3450BE | 3 | 0 | 1 | 1048576 |
dbo | MyData | MyData_CCI | SubjectId | 1 | 12 | 14 | 35785 | 10000F82C113010FAB4699210D5873CEF401 | 1000A2093746B569CE43A5DD4E611EC2232B | 3 | 0 | 1 | 1048576 |
dbo | MyData | MyData_CCI | SubjectId | 1 | 13 | 11 | 36677 | 1000A2093746B569CE43A5DD4E611EC2232B | 1000ECAD306ED4688C4686719B0832A5C345 | 3 | 0 | 1 | 1048576 |
dbo | MyData | MyData_CCI | SubjectId | 1 | 14 | 12 | 37569 | 1000ECAD306ED4688C4686719B0832A5C345 | 1000D8255C9755EB7E41BD5AA804D4D9DF14 | 3 | 0 | 1 | 1048576 |
dbo | MyData | MyData_CCI | SubjectId | 1 | 15 | 9 | 38474 | 1000A155D38E3D1A8540884B1115D3C54DD2 | 100012931F70BB180A4EBD61C7D33663A9F6 | 3 | 0 | 1 | 1048576 |
dbo | MyData | MyData_CCI | SubjectId | 1 | 16 | 7 | 39324 | 1000EC8501B92CE62847875B143C5C24915B | 10006FC55B8556D70E4D84D85BC913E5F067 | 3 | 0 | 1 | 1048576 |
dbo | MyData | MyData_CCI | SubjectId | 1 | 17 | 8 | 40220 | 10006FC55B8556D70E4D84D85BC913E5F067 | 10001528C2B94ED06E40B365ADB168364FC4 | 3 | 0 | 1 | 1048576 |
dbo | MyData | MyData_CCI | SubjectId | 1 | 18 | 5 | 41104 | 10001528C2B94ED06E40B365ADB168364FC4 | 1000A4772319590F5E48BCBCBA2771741A25 | 3 | 0 | 1 | 1048576 |
dbo | MyData | MyData_CCI | SubjectId | 1 | 19 | 4 | 41981 | 10005578600DA8DCA34A801A16570EDAEBB5 | 100012931F70BB180A4EBD61C7D33663A9F6 | 3 | 0 | 1 | 1048576 |
dbo | MyData | MyData_CCI | SubjectId | 1 | 20 | 532 | 42006 | 1000DDEAACB5C5E57D4C9B131CAF92E77395 | 1000AFF3D2B670ACBB46B04E1CEB27BFFAD0 | 3 | 0 | 1 | 28480 |
Table 'MyData'. Scan count 4, logical reads 0, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 10751, lob physical reads 0, lob page server reads 0, lob read-ahead reads 10, lob page server read-ahead reads 0.
Table 'MyData'. Segment reads 8, segment skipped 13.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.