elasticsearchvectorelasticsearch-query

Dense vector array and cosine similarity


I would like to store an array of dense_vector in my document but this does not work as it does for other data types eg.

PUT my_index
{
  "mappings": {
    "properties": {
      "my_vectors": {
        "type": "dense_vector",
        "dims": 3  
      },
      "my_text" : {
        "type" : "keyword"
      }
    }
  }
}

PUT my_index/_doc/1
{
  "my_text" : "text1",
  "my_vector" : [[0.5, 10, 6], [-0.5, 10, 10]]
}

returns:

'1 document(s) failed to index.',
    {'_index': 'my_index', '_type': '_doc', '_id': 'some_id', 'status': 400, 'error': 
      {'type': 'mapper_parsing_exception', 'reason': 'failed to parse', 'caused_by': 
        {'type': 'parsing_exception', 
         'reason': 'Failed to parse object: expecting token of type [VALUE_NUMBER] but found [START_ARRAY]'
        }
      }
    }

How do I achieve this? Different documents will have a variable number of vectors but never more than a handful.

Also, I would then like to query it by performing a cosineSimilarity for each value in that array. The code below is how I normally do it when I have only one vector in the doc.

"script_score": {
    "query": {
        "match_all": {}
    },
    "script": {
        "source": "(1.0+cosineSimilarity(params.query_vector, doc['my_vectors']))",
        "params": {"query_vector": query_vector}
    }
}

Ideally, I would like the closest similarity or an average.


Solution

  • The dense_vector datatype expects one array of numeric values per document like so:

    PUT my_index/_doc/1
    {
      "my_text" : "text1",
      "my_vector" : [0.5, 10, 6]
    }
    

    To store any number of vectors, you could make the my_vector field a "nested" type which would contain an array of objects, and each object contains a vector:

    PUT my_index
    {
      "mappings": {
        "properties": {
          "my_vectors": {
            "type": "nested",
            "properties": {
              "vector": {
                "type": "dense_vector",
                "dims": 3  
              }
            }
          },
          "my_text" : {
            "type" : "keyword"
          }
        }
      }
    }
    
    PUT my_index/_doc/1
    {
      "my_text" : "text1",
      "my_vector" : [
        {"vector": [0.5, 10, 6]}, 
        {"vector": [-0.5, 10, 10]}
      ]
    }
    

    EDIT

    Then, to query the documents, you can use the following (as of ES v7.6.1)

    {
      "query": {
        "nested": {
          "path": "my_vectors",
          "score_mode": "max", 
          "query": {
            "function_score": {
              "script_score": {
                "script": {
                  "source": "(1.0+cosineSimilarity(params.query_vector, 'my_vectors.vector'))",
                  "params": {"query_vector": query_vector}
                }
              }
            }
          }
        }
      }
    }
    

    Few things to note: