mongodbmongodb-queryaggregation-frameworkmongodb-indexes

In MongoDB, is there a performance difference between $in covering all possible values vs leaving the field out in a compound index?


I’m working with a MongoDB compound index like this:

db.users.createIndex({ gender: 1, city: 1, age: 1 })

Suppose city has only a few possible values, e.g., "Chicago", "LA", "NY".

I can query either by leaving city out:

db.users.find({ gender: "M", age: { $gt: 25 } })

or by explicitly specifying all possible cities in $in:

db.users.find({ gender: "M", city: { $in: ["Chicago","LA","NY"] }, age: { $gt: 25 } })

From what I understand:

My questions:

  1. If $in includes all possible values for a field, is there any performance difference compared to leaving the field out?

  2. Does the number of possible values matter in this case, or will MongoDB treat both queries the same in terms of index scanning?

  3. Are there any cases (e.g., very large datasets, range queries) where one approach might be preferable?

I want to understand if manually specifying all values in $in is ever beneficial when the field is fully covered.


Solution

  • Explain with the "allPlansExecution" option is very helpful for determining this.

    Setting up the playground with the database containing 100 documents with 3 fields:

          "city": {
            "type": "enum",
            "values": ["Chicago", "LA", "NY"]
          },
          "gender": {
            "type": "enum",
            "values": ["M","F"]
          },
          "age": {"type": "int", "min": 13, "max": 95}
    

    and an index on

    { "gender":1, "city":1, "age":1 }
    

    Running this query:

    {$match: {
          "gender": "M",
          "city": {$in: ["Chicago", "LA", "NY"]},
          "age": {$lte: 26}
    }}
    

    shows the executor needed 3 index seeks, examined 15 index keys, to return 12 documents.

    Playground - Note that the data is pseudo random, so the above numbers may vary in your run.

    Running the query without the city predicate shows the executor needed 4 index seeks, examined 15 index keys, to return 12 documents.

    Playground

    This suggests that without the city clause, the executor first did a seek for ="M",>MinKey,>MinKey to determine that the min value was "Chicago".
    Then did a seek for ="M",="Chicago",>26 and then scanned forward until a key didn't match.
    Then it did a seek for ="M",>"Chicago",>26, found "LA" and scanned forward.
    Then it did a seek for ="M",>"LA",>26, found "NY" and and scanned forward.

    When the city clause was included, the executor was able to skip the seek for ="M",>MinKey,>MinKey.

    In my opinion that is such an infinitesimally small difference that it is probably outweighed by the serialization, deserialization, and network transmit time required due to the size difference for the query to include the array of all possible values.