azureazure-cosmosdbazure-cosmosdb-sqlapi

How To Check For Duplicate Objects in a CosmosDB Document


I took over a service at work. This service looks at some data and creates a new version entry in CosmosDB when the data changes.

A psuedo-example of the data is below. The premise is that we have data for every single airport's flight paths in a JSON array of objects which contain flight data such as start airport, end airport, airplane model, distance, etc. When the flights available at an airport change, we publish a new entry to cosmos with a new unique ID, same airport name, and the cosmosdb has updated paths in the document.

Upon checking Cosmos, what I see is two properties: a unique ID (you can ignore this for the question) and a name which refers to an airport code (SJC = San Jose Airport, LAX = Los Angeles Airport).

When I click on an entry such as 1001; SJC, then it brings up the associated document which has the json structure below

| Id | Name | 
| - | -|
| 1001 | SJC |
| 1002 | SJC |
| 1003 | LAX |
| 1004 | LAG |
| 1005 | SEA |
{
  "Id" : "1001",
  "Name" : "SJC",
  "CreatedAt" "2025-03-20..."
  "Flights" : [
    {
      "StartCity" : "SanJose",
      "EndCity" : "LosAngeles",
      "Airline" : "Southwest",
      "Airplane" : "SmallCarier",
      "Distance" : "450",
    },
    {
      "StartCity" : "Tokyo",
      "EndCity" : "SanJose",
      "Airline" : "ZipAir",
      "Airplane" : "Boeing747",
      "Distance" : "5200",
    },
    {
      "StartCity" : "SanJose",
      "EndCity" : "LosAngeles",
      "Airline" : "Southwest",
      "Airplane" : "SmallCarier",
     "Distance" : "450",
    },
...
  ]
}

Now, the issue is that the Flights array has a duplicate entry at index 2 (duplicate of index 0) due to a code bug and we need to find all affected documents.

I need a Cosmos SQL query that can go through every document in CreatedAt timerange (issue was Feb 10-11 specfically) and do a duplicate check on the internal objects of Flights. If the document has even one duplicate, then return the Id,Name and even better would be returning the actual duplicate too.

I asked this exact thing to ChatGPT while filtering for just one Name (Airport) and got some random answer which pairs objects together. This causes false positives (it forcibly matches entries together) so my suggestion to chatgpt was subtract the matches by the total number of entries in the array. If that number was > 0, then besides the false positive we would see some number of duplicates. Everytime the query becomes complex, chatgpt does something where it uses "HAVING" or an "ARRAY" operator which Cosmos doesn't support.

SELECT f.name, f.version
FROM f
JOIN l1 IN f.Flights
JOIN l2 IN f.Flights
WHERE f.name = "SJC"
AND l1 = l2
AND l1 != {} -- Ensures non-empty objects

I am trying to write a command line solution with C# or python which I can do. However, I would love a way to query once without application level intervention just so everything is in one place (have other things to check/compare on cosmos) and editing it is easier.


Solution

  • You were on the right track with HAVING except that in cosmos we have to use a sub query that has the GROUP BY expression that covers all the fields in the Flight object and a COUNT. Then we just need to filter the results to only records that have more than 1 records.

    SELECT *
    FROM 
    (
        SELECT f.Id, f.Name, Count(1) as 'Count', l.StartCity, l.EndCity, l.Airline, l.Airplane, l.Distance
        FROM f 
        JOIN l IN f.Flights
        GROUP BY c.InvoiceNumber, l.StartCity, l.EndCity, l.Airline, l.Airplane, l.Distance
    ) a
    WHERE a.Count > 1
    

    This will return results similar to this:

    [
        {
            "Id" : "1001",
            "Name" : "SJC",
            "Count" : 2
            "StartCity" : "SanJose",
            "EndCity" : "LosAngeles",
            "Airline" : "Southwest",
            "Airplane" : "SmallCarier",
            "Distance" : "450"
        },
        ...
    ]
    

    This works by flattening out the Flights array using a self join so that we can access the Id and Name from the parent node in the same record as the flight details, from there it is a simple SQL duplicate record query using GROUP BY