[SOLVED] Get distinct pattern match results (substrings) from PyMongo

Get distinct pattern match results (substrings) from PyMongo

What is the most efficient way to query a MongoDB collection for unique/distinct substrings in a given field?

Example documents:

{"_id": "1234.abc.test1", "some_key": "some_value"}
{"_id": "1234.abc.test2", "some_key": "some_value"}
{"_id": "0420.def.test3", "some_key": "some_value"}

The document IDs above follow an internal namespacing convention. I need to know what all of the distinct first elements of that namespacing are.

The desired output of the query on the above collection:

1234
0420

I am trying to avoid getting the entire dataset back only to do row['_id'].split('.')[0] on each row afterwards. Ideally, the query should return only the distinct list of those substrings.

Solution

The idea is actually the same as yours(i.e. splitting by . and get the first element), then $group them to get distinct records.

db.collection.aggregate([
  {
    $project: {
      first: {
        "$arrayElemAt": [
          {
            "$split": [
              "$_id",
              "."
            ]
          },
          0
        ]
      }
    }
  },
  {
    "$group": {
      "_id": "$first"
    }
  }
])

Here is the Mongo playground for your reference.

Here is the PyMongo implementation of the above query:

pipeline = [
    {"$project": {"first": {"$arrayElemAt": [{"$split": ["$_id", "."]}, 0]}}},
    {"$group": {"_id": "$first"}}
]
result = self.collection.aggregate(pipeline=pipeline, allowDiskUse=False)