I’m working with a MongoDB compound index like this:
db.users.createIndex({ gender: 1, city: 1, age: 1 })
Suppose city has only a few possible values, e.g., "Chicago", "LA", "NY".
I can query either by leaving city out:
db.users.find({ gender: "M", age: { $gt: 25 } })
or by explicitly specifying all possible cities in $in:
db.users.find({ gender: "M", city: { $in: ["Chicago","LA","NY"] }, age: { $gt: 25 } })
From what I understand:
Leaving city out makes MongoDB internally use [MinKey, MaxKey] for the index scan.
Using $in with all possible values essentially covers the same range.
My questions:
If $in includes all possible values for a field, is there any performance difference compared to leaving the field out?
Does the number of possible values matter in this case, or will MongoDB treat both queries the same in terms of index scanning?
Are there any cases (e.g., very large datasets, range queries) where one approach might be preferable?
I want to understand if manually specifying all values in $in is ever beneficial when the field is fully covered.
Explain with the "allPlansExecution" option is very helpful for determining this.
Setting up the playground with the database containing 100 documents with 3 fields:
"city": {
"type": "enum",
"values": ["Chicago", "LA", "NY"]
},
"gender": {
"type": "enum",
"values": ["M","F"]
},
"age": {"type": "int", "min": 13, "max": 95}
and an index on
{ "gender":1, "city":1, "age":1 }
Running this query:
{$match: {
"gender": "M",
"city": {$in: ["Chicago", "LA", "NY"]},
"age": {$lte: 26}
}}
shows the executor needed 3 index seeks, examined 15 index keys, to return 12 documents.
Playground - Note that the data is pseudo random, so the above numbers may vary in your run.
Running the query without the city
predicate shows the executor needed 4 index seeks, examined 15 index keys, to return 12 documents.
This suggests that without the city
clause, the executor first did a seek for
="M",>MinKey,>MinKey
to determine that the min value was "Chicago".
Then did a seek for
="M",="Chicago",>26
and then scanned forward until a key didn't match.
Then it did a seek for ="M",>"Chicago",>26
, found "LA" and scanned forward.
Then it did a seek for ="M",>"LA",>26
, found "NY" and and scanned forward.
When the city
clause was included, the executor was able to skip the seek for ="M",>MinKey,>MinKey
.
In my opinion that is such an infinitesimally small difference that it is probably outweighed by the serialization, deserialization, and network transmit time required due to the size difference for the query to include the array of all possible values.