pythonmongodbmongodb-querypymongo-3.x

Get distinct pattern match results (substrings) from PyMongo


What is the most efficient way to query a MongoDB collection for unique/distinct substrings in a given field?

Example documents:

{"_id": "1234.abc.test1", "some_key": "some_value"}
{"_id": "1234.abc.test2", "some_key": "some_value"}
{"_id": "0420.def.test3", "some_key": "some_value"}

The document IDs above follow an internal namespacing convention. I need to know what all of the distinct first elements of that namespacing are.

The desired output of the query on the above collection:

1234
0420

I am trying to avoid getting the entire dataset back only to do row['_id'].split('.')[0] on each row afterwards. Ideally, the query should return only the distinct list of those substrings.


Solution

  • The idea is actually the same as yours(i.e. splitting by . and get the first element), then $group them to get distinct records.

    db.collection.aggregate([
      {
        $project: {
          first: {
            "$arrayElemAt": [
              {
                "$split": [
                  "$_id",
                  "."
                ]
              },
              0
            ]
          }
        }
      },
      {
        "$group": {
          "_id": "$first"
        }
      }
    ])
    

    Here is the Mongo playground for your reference.

    Here is the PyMongo implementation of the above query:

    pipeline = [
        {"$project": {"first": {"$arrayElemAt": [{"$split": ["$_id", "."]}, 0]}}},
        {"$group": {"_id": "$first"}}
    ]
    result = self.collection.aggregate(pipeline=pipeline, allowDiskUse=False)