I am using Elasticsearch to compute the cosine similarity between paragraphs and search queries. However, the tutorials I find online seem to indicate that you can only have one vector per indexed document.. That is unfortunate for my use case since each document contains multiple paragraphs, and thus have multiple vectors associated with them.
So, when using similarity metrics like cosine similarity or k-nearest neighbors, what is the best way to deal with this? Do I add multiple JSONs per document to the index? That is, one for each vector?
Or, is there a smarter way to do this?
From version 8.11, you can store multi-vector documents in Elastic Search. It is explained in this blog post.
Example from index creation (copied from the source):
PUT my-long-text-index
{
"mappings": {
"properties": {
"my_long_text_field": {
"type": "nested", //because there can be multiple vectors per doc
"properties": {
"vector": {
"type": "dense_vector" //the vector used for ranking
},
"text_chunk": {
"type": "text" //the text from which the vector was created
}
}
}
}
}
}
Example of data ingestion:
PUT my-long-text-index/_doc/1
{
"my_long_text_field" : [
{
"vector" : [23,14,8],
"text_chunk" : "doc 1 chunk 1"
},
{
"vector" : [34,95,17],
"text_chunk" : "doc 1 chunk 2"
}
]
}