I've been researching a solution to a performance problem with a system I'm responsible for, and I think at least part of the problem is due to database query performance. We use stored procedures to query "pages" of data in a pretty standard way. However, this paging appears to be more costly when the datasets get large.
Given this simple table populated with sample data:
create table Data (
Value uniqueidentifier not null,
constraint PK_Data primary key clustered (Value)
)
insert into Data
-- SeedTable has ~2M rows
select newid() from SeedTable
And this stored procedure to return paged data: (this requires Sql2012 apparently, though the Sql2008 style of using ROW_NUMBER() behaves the same):
create proc
GetDataPage @Offset int, @Count int
as
select Value
from Data
order by Value
offset @Offset rows
fetch next @Count rows only
I then test the performance of this sproc with this C# code:
const int PageSize = 50;
const int MaxCount = 50000;
using (var conn = new SqlConnection("Data Source=.;Initial Catalog=TestDB;Integrated Security=true;")) {
conn.Open();
int a = 0;
for (int i = 0; ; i += PageSize) {
using (var cmd = conn.CreateCommand()) {
cmd.CommandType = System.Data.CommandType.StoredProcedure;
cmd.CommandText = "GetDataPage";
var oid = cmd.CreateParameter();
var offset = cmd.CreateParameter();
offset.Value = i;
offset.ParameterName = "Offset";
cmd.Parameters.Add(offset);
var count = cmd.CreateParameter();
count.Value = PageSize;
count.ParameterName = "Count";
cmd.Parameters.Add(count);
var sw = Stopwatch.StartNew();
int c = 0;
using(var reader = cmd.ExecuteReader()) {
while (reader.Read()) {
c++;
}
}
a += c;
sw.Stop();
Console.WriteLine(sw.ElapsedTicks + "\t" + a);
if (c < PageSize || a >= MaxCount)
break;
}
}
}
When I chart the output of this code I get the following:
I would have expected that paging like this in SQL would have constant time performance, or perhaps logarithmic at worst, but it is pretty clear from the chart that performance is linear.
Are there any special tricks (hints) to make this work better?
Is there another approach to this that might be faster?
Do other databases behave the same way?
Changing the experimental code to use the "page from" technique, that Kevin Suchlicki suggests, results in the following:
Very impressive. This performance looks more like what I would expect/want. Now I just need to figure out if I can apply this to my real problem. The potential issue being that it doesn't allow for "random access" of the data, but a forward only cursor-like access. I'm aware that it must look like what I'm doing violates every notion of good database design.
The most obvious possibility is in the app design itself. Offer your users filter criteria. Users usually have some idea what they are looking for and would rather not page thru 1000 pages of returned results. How often do you pass page 10 on a google search?
Having said that, you could try storing the id (clustered index value) of the last row returned on the previous page and use that in your SQL where clause. If you need to allow sorting on different keys (e.g. last name), then store the clustered index id value and the final last name of the previous page. Then write your SQL like this (you always need to order on your key field and clustered id value in order to deterministically order the records in the case of duplicate key values):
select top (@count) Id, LastName, FirstName
from Data
where LastName >= @previousLastName and Id > @previousId
order by LastName, Id
You would also want to index all the fields that could be sort keys. Not sure how the above would perform but I would expect the search on indexed fields would perform O(log n).
Another option might be to persist the full list, in order, with row value, every time the source data changes, behind the scenes, and have the app pull from the persisted table.
Good question... Let us know how it turns out please!