mongodbmongodb-queryaggregation-frameworkmongodb-indexes

how to get mongodb to use a compound index where one of the fields is not involved?


compound index:

{
  A: 1,
  B: 1,
  C: 1
}

query:

col.aggregate([
  {
    $match: {
     B: "ssome_value",
     C: "some_other_value,
    }
  }
])

The query does not involve A, so the compound index is not used. I can do this to trick it to use the index:

col.aggregate([
  {
    $match: {
     A: { $exists: true },
     B: "ssome_value",
     C: "some_other_value,
    }
  }
])

But the problem is that the results would be inaccurate. It should return all documents regardless of A, not just that it exists. Whether or not A exists, it should still be returned.


Solution

  • I'm going to take a different approach than the other one in the current answer. While that answer does attempt to answer the "how", which is what you directly asked about, this answer is going to go into the "why" of the behavior. Hopefully that knowledge will help you better reason about how to think about and approach this situation. Along the way we incidentally show an alternative syntax that satisfies your original request, though it's not obvious whether it should be used or not.

    The reason that the database doesn't want to use that index in this situation is because it probably isn't very efficient to do so. Because the index is a well-ordered data structure based on how the fields are placed in the index definition when it is created. Since your index is first ordered by A and your query is going to retrieve documents regardless of their value of A, this means:

    1. The items that are relevant to your query are scattered all throughout the index.
    2. The database could end up doing a significant amount of work trying to jump all around the index trying to find these relevant entries.

    By contrast, an index whose leading key is used by the query contains all of the relevant in one section of the index. The database knows this and it is precisely what makes them so effective at supporting selective queries.

    In order to explore this further, we can use this alternative syntax to trigger index usage (or at least eligibility):

    { 
        A: { '$gte': MinKey() }, 
        B: 'ssome_value', 
        C: 'some_other_value' 
    }
    

    Here we are just telling the system that we want documents with any value for A by using the Min key BSON type (see also the spec), e.g.:

    >  db.foo.aggregate([{$match:{A: { '$gte': MinKey() }, B: 'ssome_value', C: 'some_other_value' }}])
    [
      { _id: 0, B: 'ssome_value', C: 'some_other_value' },
      { _id: 2, A: 123, B: 'ssome_value', C: 'some_other_value' },
      { _id: 1, A: true, B: 'ssome_value', C: 'some_other_value' }
    ]
    

    This syntax generates the following boundaries on the noted index:

            indexBounds: {
              A: [ '[MinKey, MaxKey]' ],
              B: [ '["ssome_value", "ssome_value"]' ],
              C: [ '["some_other_value", "some_other_value"]' ]
            }
    

    Be careful here though. Just because you can do this does not necessarily mean that you should. I mentioned above that there is a good reason that the optimizer avoids using this index by default. It is possible that this index could be more effective than doing a full collection scan. But it is also possible that using the index would be worse!

    This mostly comes down to how many distinct values of A there are in the collection. In the case that A has very few values, the index will probably be quite helpful. Adding more (non-matching) documents using just the three values of A above, we can observe the following:

     db.foo.distinct('A').length
    3
    > db.foo.count()
    100
    >  db.foo.aggregate([{$match:{A: { '$gte': MinKey() }, B: 'ssome_value', C: 'some_other_value' }},{$count:"count"}])
    [ { count: 3 } ]
    >  db.foo.aggregate([{$match:{A: { '$gte': MinKey() }, B: 'ssome_value', C: 'some_other_value' }}]).explain("executionStats").executionStats
    {
      nReturned: 3,
      totalKeysExamined: 4,
      totalDocsExamined: 3,
      ...
    

    But on the other egregious end, consider the case where the values of A are unique throughout the documents in the collection. Here the database will scan the full index:

    > db.foo.distinct('A').length
    100
    > db.foo.count()
    100
    >  db.foo.aggregate([{$match:{A: { '$gte': MinKey() }, B: 'ssome_value', C: 'some_other_value' }},{$count:"count"}])
    [ { count: 3 } ]
    >  db.foo.aggregate([{$match:{A: { '$gte': MinKey() }, B: 'ssome_value', C: 'some_other_value' }}]).explain("executionStats").executionStats
    {
      nReturned: 3,
      totalKeysExamined: 100,
      totalDocsExamined: 3,
    

    While scanning the index is often faster than scanning the collection directly, consider also that the database re-traversed the index from the root node almost every time (as opposed to scanning sequentially across the leaf nodes). We can tell that by looking at the metrics inside of the IXSCAN stage:

          indexBounds: {
            A: [ '[MinKey, MaxKey]' ],
            B: [ '["ssome_value", "ssome_value"]' ],
            C: [ '["some_other_value", "some_other_value"]' ]
          },
          keysExamined: 100,
          seeks: 98,
    

    This is not what makes indexes the most efficient for queries, so it is unclear if this approach would actually be faster for a situation where the A field has high cardinality.

    So in short: this is a situation where you may be in a better position than the database to understand what the most efficient approach to this particular operation is. If so, and if it doesn't make sense to have the other index, then you may have to help the database by doing something like changing the query or hinting the index as we've done here.