mongodbpymongoknnvector-search

Filter on MongoDB Vector Search doesn't work as expected


I'm building an aggregation pipeline in mongodb and I'm encountering some unexpected behaviour.

The pipeline is as follow:

[{
   "$search":{
      "index":"vector_index",
      "knnBeta":{
         "vector":[
            -0.30345699191093445,
            0.6833441853523254,
            1.2565147876739502,
            -0.6364057064056396
         ],
         "path":"embedding",
         "k":10,
         "filter":{
            "compound":{
               "filter":[
                  {
                     "text":{
                        "path":"my.field.name",
                        "query":[
                           "value1",
                           "value2",
                           "value3",
                           "value4"
                        ]
                     },
                     {
                     "text":{
                        "path":"my.field.name2",
                        "query":"something_else",
                     }
                  }
               ]
            }
         }
      }
   }
},
    {
   "$project":{
      "score":{
         "$meta":"searchScore"
      },
      "embedding":0
   }
}

]

The pipeline (should) do a vector search according (vector_index, embedding, vector) (it work correctly it seems. With a filter, in particular the filter should limit the vector search to documents having my.field.name equal to value1 or value2 or ... and my.field.name2 equal to something_else.

Instead, only the second filter works, or at least it seems (the value of the second filter is a single letter).

I tried using the must clause as well in place of the filter inside the compound clause but the outcome remains the same.

I tried also removing the second filtering (the one without the list) and I still get unfiltered results.

Am I doing something wrong? how can it correctly?


Solution

  • Ok, I should have found the reason of this behaviour and how to solve this.

    As a default, MongoDB Atlas Search uses a as Search Analysers (for fields that are not vectors) the Standard Analyzer, in JSON:

        {
          "mappings": {
            "fields": {
              "title": {
                "type": "string",
                "analyzer": "lucene.standard"
              }
            }
          }
        } 
    

    The standard analyser

    divides text into terms based on word boundaries

    As a consequence, if the search term contains a space, it will split by spaces and search for ANY of the produced words.

    To avoid this behaviour it is necessary to use the Keyword Analyser, that on the other hand uses the whole string as search item.

    In the end, the Index definition should look like this:

    {
      "mappings": {
        "dynamic": true,
        "fields": {
          "embedding": {
            "dimensions": 768,
            "similarity": "cosine",
            "type": "knnVector"
          },
          "my.field.name": {
            "analyzer": "lucene.keyword",
            "type": "string"
          }
        }
      }
    }
    

    In particular, the first part is the definition of the (custom) vector search while

    "my.field.name": { "analyzer": "lucene.keyword", "type": "string" }

    specifies that we want to use the keyword analyser.