I have such schema
schema embeddings {
document embeddings {
field id type int {}
field text_embedding type tensor<double>(d0[960]) {
indexing: attribute | index
attribute {
distance-metric: euclidean
}
}
}
rank-profile closeness {
num-threads-per-search:1
inputs {
query(query_embedding) tensor<double>(d0[960])
}
first-phase {
expression: closeness(field, text_embedding)
}
}
Such services:
...
<container id="query" version="1.0">
<search/>
<nodes>
<node hostalias="query" />
</nodes>
</container>
<content id='mind' version='1.0'>
<redundancy>1</redundancy>
<documents>
<document type='embeddings' mode="index"/>
</documents>
<nodes>
<node hostalias="content1" distribution-key="0"/>
</nodes>
</content>
...
Then I have the number of queries of the same format:
{
'yql': 'select * from embeddings where ({approximate:false, targetHits:100} nearestNeighbor(text_embedding, query_embedding));',
'timeout': 5,
"hits":100,
'input': {
'query(query_embedding)': [...],
},
'ranking': {
'profile': 'closeness',
},
}
which are then run via app.query_batch(test_queries)
The problem is some responses look like this (and contain id
field as integers, just like I inserted):
{'id': 'id:embeddings:embeddings::786559', 'relevance': 0.5703559830732123, 'source': 'mind', 'fields': {'sddocname': 'embeddings', 'documentid': 'id:embeddings:embeddings::786559'}}
and others look like this (neither containing int id
as I inserted, nor keeping the format of the previous example):
{'id': 'index:mind/0/b0dde169c545ce11e8fd1a17', 'relevance': 0.49024561522459087, 'source': 'mind'}
How can I make all responses look like the first one? Why are they different at all?
Some of them are filled with content and some are not, I suppose because it timed out. Check the coverage info, and run with traceLevel=3 to see more details.
Some more background info on what's going on:
Searches are executed in two phases: First, minimal information on each hits
hit is returned from each content node up to the issuing container. These partial lists are then merged to produce the final hits
length list of matches. For those we execute phase two, which is to fill the content of the final hits. This involves doing another request to each of the content nodes to get the relevant content.
If there's little time left, or lots of data, or expensive summary features to compute, or a slow disk subsystem or network, or a node in some kind of trouble, this may time out leaving only some hits filled so that you'll see this.
Why are the id's not the true document id in these cases? The text string id is stored in the disk document blob but not in memory as an attribute, so it needs to be fetched in the fill phase too. If it is not, an internally generated unique id is used instead.