azureazure-cosmosdbazure-cosmosdb-sqlapi

Inconsistencies when using PartitionKeyBuilder with Azure Cosmos .NET SDK and Hierarchical Partition Keys


I have noticed this on a few projects since using hierarchical partition keys.

When building out my queries (not point reads) and wanting to use a subset of my partition keys (for example PartitionKey 1 and 2 out of 3) I'll use the PartitionKeyBuilder like so:

PartitionKeyBuilder partitionKeyBuilder = new PartitionKeyBuilder().Add("Key1").Add("Key2")

And it works when I use a query such as this:

QueryDefinition query = new QueryDefinition($"SELECT * FROM d");

QueryRequestOptions options = new QueryRequestOptions
{
    PartitionKey = partitionKeyBuilder.Build(),
};

var feedIterator = _container.GetItemQueryIterator<dynamic>(query, null, options);

I will get back only results that have "Key1" as the value for the first partition key and "Key2" for the second as well as any value for the third partition key.

new entities are added over time and all goes well. No code changes are made and all of a sudden certain partitions no longer get returned using the very same query through the SDK.

They will work through the portal and other methods but just not when called through the SDK using the PartitionKeyBuilder with QueryRequestOptions.

If I change the query to no longer use the PartitionKeyBuilder (essentially moving the partition key lookups into the query) like so:

QueryDefinition query = new QueryDefinition($"SELECT * FROM d WHERE d.PartitionKey1='Key1' AND d.PartitionKey2='Key2'");

var feedIterator = _container.GetItemQueryIterator<dynamic>(query, null);

It works as expected!

I cannot find a rhyme or reason for it. There are no code changes. The keys are strings and are always strings. It just seems to happen to certain groups of entities or partitions in a container at some point in time and once it starts no future entities can be queried using the PartitionKeyBuilder in the SDK and I have to re-write my queires to include the keys in the main query to resolve the issue.

Here are some more patterns I have found over time:

On a project from a few weeks ago there was a reference to an older version of the CosmosDB SDK in one of the libraries and when I updated each library to the latest SDK with hierarchical support the issue was resolved - however I am now finding that this is happening on a new project I started from scratch with no such legacy references! The issue happens on both my local development environment as well as when using a CI/CD build system where the NuGet packages should be freshly pulled with no local references to older SDKs. I even I explicitly set the SDK version to 3.53.1 and use the --no-cache flag on NuGet restore and it still happens after the build.

I have also noticed that (if the query is running inside an API) it will sometimes not work on one of my servers (for example -dev) but not on another (-prd) - the code is exactly the same and the documents are very similar but each environment uses a different CosmosDB account and container. Again moving the keys into the query aligns them both - but this is starting to get strange. Could the issue somehow get created on the database level and it just stops accepting queries using partition keys included in the QueryRequestOptions? Aside from one account using dev data and another using production there are no differences.

Has anyone else seen this happen? Is there something more I should look at? I really feel like there must be some sort of bug on the cosmos database side as I feel like I have exhausted every path to find a resolution. Could be that an older SDK is being swapped in on some build system somehow? As mentioned above I am using Azure DevOps, explicitly set the SDK version to 3.53.1 and use the --no-cache flag on NuGet restore and it still happens but perhaps there is another method I should consider? Could the Azure resource itself all of a sudden have an issue with the hierarchal keys inside of QueryRequestOptions?

Note: When this issue occurs it is not that the keys are ignored - it simply returns NO results at all.

Update 1: Screenshot of partitons & query example

Here is screenshot of the data in my cosmos container:

partitons

Here is a sample query where I am getting a distinct list of /Day partitions to use for my reporting front end:

query

A typical /Day partition will have 300-1,200 very small files (Only 8-10 JSON properties) so an entire /Month would be about 30x that and the entire /Year 365x that (so maybe 428k documents).

I checked my metrics and the entire database is below 950mb in size so I think I should still have everything sitting in a single physical partition. Here is a screenshot of that as well:

storage


Solution

  • The problem is the way that you read the query. Once your container grows in size and/or RU/s, it can grow in physical partitions.

    When that happens, your query, which is using a partial Partition Key, can become a cross-partition query, which means that it can reach out to multiple physical partitions to get all the results, and that normally translates in multiple pages. Even on a single partition scenario, there are other reasons that can lead a query to execute in multiple pages.

    In summary, your code should loop the query until HasMoreResults is false, following the documentation.

    using (var feedIterator = _realTimePOSLineItemsContext.Container.GetItemQueryIterator<string>(query, null, options))
    {
      while (feedIterator.HasMoreResults)
      {
        var feedResponse = await feedIterator.ReadNextAsync();
        
        // assuming values is a List
        values.AddRange(feedResponse.ToList());
      }
    }