elasticsearch-6

search phrase or words in document with timestamped words


I've been trying to do this for some days, I guess it's time to ask for a little help.
I'm using elasticsearch 6.6 (I believe it could be upgraded if needed) and nest for c# net5.
The task is to create an index where the documents are the result of a speech-to-text recognition, where all the recognized words have a timestamp (so that that said timestamp can be used to find where the word is spoken in the original file). There are 1000+ texts from media files, and every file is 4 hours long (that means usually 5000~15000 words).

Main idea was to split every text in 3 sec long segments, creating a document with the words in that time segment, and index it so that it can be searched.
I thought that it would not work that well, so next idea was to create a document for every window of 10~12 words scanning the document and jumping by 2 words at time, so that the search could at least match a decent phrase, and have highlighting of the hits too.
Since it's yet far from perfect, I thought it would be nice to index every whole text as a document so to maintain its coherency, the problem is the timestamp associated with every word. To keep this relationship I tried to use nested objects in the document:

PUT index-tapes-nested
{
    "mappings" : {
        "_doc" : {
            "properties" : {
                "$type" : { "type" : "text" },
                "ContentId" : { "type" : "long" },
                "Inserted" : { "type" : "date" },
                "TrackId" : { "type" : "long" },
                "Words" : {
                    "type" : "nested",
                    "properties" : {
                      "StartMillisec" : { "type" : "integer" },
                      "Word": { "type" : "text" }
                    }
                }
            }           
        }
    }
}

This kinda works, but I don't know exactly how to write the query to search in the index.
A very basic query could be for example:

GET index-tapes-nested/_search
{
  "query":{
    "nested":{
      "path":"Words",
      "score_mode":"avg",
      "query":{
        "match":{
          "Words.Word": "a bunch of things"
        }
      },
      "inner_hits": {}
    }
  }
}

but something like that, especially with the avg scoring, gives low quality results; there could be the right document in the hits, but it doesn't get the word order, so it's not certain and it's not clear.
So as far as I understand it the span_near should come handy in these situations, but I get no results:

GET index-tapes-nested/_search
{
  "query": {
    "nested":{
      "path":"Words",
      "score_mode": "avg",
      "query": {
        "span_near": {
          "clauses": [
            { "span_term": { "Words.Word": "bunch" }},
            { "span_term": { "Words.Word": "of" }},
            { "span_term": { "Words.Word": "things" }}
          ],
          "slop": 2,
          "in_order": true
        }
      }
    }
  }
}

I don't know much about elasticsearch, maybe I should change approach and change the model, maybe rewriting the query is enough, I don't know, this is pretty time consuming, so any help is really appreciated (is this a fairly common task?). For the sake of brevity I'm cutting some stuff and some ideas, I'm available to give some data or other examples if needed.
I also had problems with the c# nest client to manage the nested index, but that is another story.


Solution

  • This could be interpreted in a few ways i guess, having something like an "alternative stream" for a field, or metadata for every word, and so on. What i needed was this: https://github.com/elastic/elasticsearch/issues/5736 but it's not yet done, so for now i think i'll go with the annotated_text plugin or the 10 words window.
    I have no idea if in the case of indexing single words there can be a query that 'restores' the integrity of the original text (which means 1. grouping them by an id 2. ordering them) so that elasticsearch can give the desired results.
    I'll keep searching in the docs if there's something interesting, or if i can hack something to get what i need (like require_field_match or intervals query).