azureazure-cosmosdbazure-cosmosdb-sqlapi

How to find duplicates in a nested array in cosmos db without GROUP BY and COUNT


I am trying to find duplicates in a nested object in a collection. In ye olde SQL, I would do this with some sort of GROUP BY and a COUNT. Cosmos DB doesn't support GROUP BY (as far as I can see) so I am trying to find a work around. One limitation is that I only have access to the data explorer in the azure portal (Don't ask).

To explain in more detail, suppose you have a collection like the following. Note that the first item has a duplicate in the "stuff" collection:

[
    {
        "id": "1",
        "Name": "Item with duplicate stuff",
        "stuff" : [
            {
                "name" : "A",
            },
            {
                "name" : "B",
            },
            {
                "name" : "A"
            }  
        ]
    },
    {
        "id": "2",
        "Name": "Item with unique stuff",
        "stuff" : [
            {
                "name" : "A",
            },
            {
                "name" : "B",
            },
            {
                "name" : "C"
            }  
        ]
    }    

I want to find all the items in my collection that have duplicates in the "stuff" property. So in this case it would return the item with id "1". Something like this would do nicely:

[
    {
        "id": "1"
    } 
] 

Nothing I have tried has worked and is unfit to show here.


Solution

  • I ran into the exact same issue. It might not be the most elegant solution, but it will get you exactly what you want. Basically the query gets the count and the distinct count in c["stuff"], compares the two counts, then finally shows the id of the record whose count and distinct count are not the same.

    SELECT 
        d.id
    FROM
    (
        SELECT c.id,
        (SELECT VALUE COUNT(f.name) FROM f IN c["stuff"]) totalCount,
        ARRAY(SELECT DISTINCT VALUE f.name FROM f IN c["stuff"]) noDupArray
        FROM c
    ) d
    WHERE 
        d.totalCount <> (SELECT VALUE COUNT(1) FROM f IN d.noDupArray)