entity-framework-coreazure-cosmosdb

How do I find bad data in CosmosDB with EF Core?


I'm trying to pull all entries from my DbSet in CosmosDb like this:

var list = context.MarketplaceTransactions.ToList();

This fails with:

info: 8/23/2024 12:44:05.527 CosmosEventId.ExecutingSqlQuery[30100] (Microsoft.EntityFrameworkCore.Database.Command)

Executing SQL query for container 'StrangeCloudDbContext' in partition '(null)' [Parameters=[]]
SELECT c
FROM root c
WHERE (c["Discriminator"] = "MarketplaceTransaction")
info: 8/23/2024 12:44:07.788 CosmosEventId.ExecutedReadNext[30102] (Microsoft.EntityFrameworkCore.Database.Command)
Executed ReadNext (2088.0046 ms, 39.43 RU) ActivityId='cdfdac63-7605-4746-a90d-9669cf864d0e',
Container='StrangeCloudDbContext', Partition='(null)', Parameters=[]
SELECT c
FROM root c
WHERE (c["Discriminator"] = "MarketplaceTransaction")
fail: 8/23/2024 12:44:07.802 CoreEventId.QueryIterationFailed[10100] (Microsoft.EntityFrameworkCore.Query)

An exception occurred while iterating over the results of a query for context type 'StrangeCloud.Api.Data.StrangeCloudDbContext'.

System.InvalidOperationException: Nullable object must have a value.

at lambda_method14(Closure, QueryContext, JObject)
at Microsoft.EntityFrameworkCore.Cosmos.Query.Internal.CosmosShapedQueryCompilingExpressionVisitor.QueryingEnumerable`1.Enumerator.MoveNext()

I've turned on detailed error reporting and sensitive logging.

Presumably one of the entries in the database has a null entry for a property that isn't null (maybe?). However, there are millions of entries in the database, and dozens of properties, and the error tells me neither the property nor entry that is the problem.

How can I find and fix the problem?


Solution

  • Instead of using ToList(), which tries to materialize all entities at once, you can use AsEnumerable() and then enumerate through the results. This way, you can catch the exception for each problematic entity:

    using System;
    using System.Collections.Generic;
    using System.Linq;
    using Microsoft.EntityFrameworkCore;
    
    async Task ProcessMarketplaceTransactions(StrangeCloudDbContext context)
    {
        var problematicEntities = new List<(int index, Exception exception)>();
        var validEntities = new List<MarketplaceTransaction>();
        int totalProcessed = 0;
    
        try
        {
            // First, attempt to execute the query without materializing entities
            var query = context.MarketplaceTransactions.AsNoTracking();
            
            // Use a small Take() to test if the query executes at all
            await query.Take(1).ToListAsync();
    
            // If we get here, the query itself is valid. Now let's process entities.
            await foreach (var entity in query.AsAsyncEnumerable())
            {
                try
                {
                    // Attempt to access all properties to force full materialization
                    var temp = new 
                    {
                        entity.Id,
                        // List all other properties here
                    };
                    validEntities.Add(entity);
                }
                catch (Exception ex)
                {
                    problematicEntities.Add((totalProcessed, ex));
                }
                totalProcessed++;
    
                // Optional: Add a break condition if you want to limit processing
                // if (totalProcessed >= 1000000) break;
            }
        }
        catch (Exception ex)
        {
            Console.WriteLine($"Query execution failed: {ex.Message}");
            return; // Exit if we can't even start the query
        }
    
        Console.WriteLine($"Total entities processed: {totalProcessed}");
        Console.WriteLine($"Valid entities: {validEntities.Count}");
        Console.WriteLine($"Problematic entities: {problematicEntities.Count}");
    
        foreach (var (index, exception) in problematicEntities)
        {
            Console.WriteLine($"Error at index {index}: {exception.Message}");
        }
    }
    
    // Usage
    await ProcessMarketplaceTransactions(context);
    
    1. First attempts to execute the query with a Take(1) to check if the query itself is valid. This will catch any exceptions that occur during query execution.
    2. If the query is valid, it uses AsAsyncEnumerable() to stream the results, which is more efficient for large datasets.
    3. It then attempts to materialize each entity individually, catching exceptions for problematic entities.
    4. The outer try-catch will handle any exceptions that occur during query execution, while the inner try-catch handles exceptions during entity materialization.