What is the most efficient way to query a MongoDB collection for unique/distinct substrings in a given field?
Example documents:
{"_id": "1234.abc.test1", "some_key": "some_value"}
{"_id": "1234.abc.test2", "some_key": "some_value"}
{"_id": "0420.def.test3", "some_key": "some_value"}
The document IDs above follow an internal namespacing convention. I need to know what all of the distinct first elements of that namespacing are.
The desired output of the query on the above collection:
1234
0420
I am trying to avoid getting the entire dataset back only to do row['_id'].split('.')[0]
on each row afterwards. Ideally, the query should return only the distinct list of those substrings.
The idea is actually the same as yours(i.e. splitting by .
and get the first element), then $group
them to get distinct records.
db.collection.aggregate([
{
$project: {
first: {
"$arrayElemAt": [
{
"$split": [
"$_id",
"."
]
},
0
]
}
}
},
{
"$group": {
"_id": "$first"
}
}
])
Here is the Mongo playground for your reference.
Here is the PyMongo implementation of the above query:
pipeline = [
{"$project": {"first": {"$arrayElemAt": [{"$split": ["$_id", "."]}, 0]}}},
{"$group": {"_id": "$first"}}
]
result = self.collection.aggregate(pipeline=pipeline, allowDiskUse=False)