elasticsearch

ElasticSearch query with filters and occurrence number


I have an ES instance that I push logs into. Then ES is used to search those logs. This is not ideal, there are plans to change it, but it is what it is. I'm sorry for a long description, but bear with me, the question is simple.

For now the search goes like this:

So, this gives me a first occurrence of a line with a particular query (because they are sorted, i.e. by timestamp). I also get the total hits, so I can present the user with:

So the user knows this is a 1/300 occurence and can prompt the UI to find the next one. The search is the same, but if user wants to search the next occurrence, I just pass from=1, from=2 etc. And the performance of this is pretty okay, since I only have to download one line from ES.

That's great. However, this is all on a website that shows user the logs. What I want to do is when the user does the inital search (before going next/previous occurrence), I want to show them the first line "after their cursor position"

For example, the user sees:

58 foo
59 bar
60 baz
[...]

so I want to scroll him down to a first matching line after line 58, not before.

The problem is, I still want to display the 1/<something> occurrences found. In this case it could be that the initial search would return for example a fifth occurrence, i.e. 5/300. And the user could go to previous/next ones.

So, the solution is to download all the matching lines (without from= and size= in query). And then just do a for loop on them, find the line that has a line number higher than the one the user sees (i.e. 58), return it. And by doing that, I can also count "which occurrence" is that, so I'll know to display for example 5/300 on UI.

The problem with that is: I have to download all the lines from ES to do that. In case of indexes that have millions and millions of lines, that could be a huge performance hit. So what I want to know is: is there a way to tell Elastic to:

so for lines like:

54 content
55 content
56 content
57 content
58 foo
59 bar
60 baz
61 content
[...]

phrase: content, seaching "from line 58", I'd have a response like:

{
  "line": {"line_number": 61, "content": "content"},
  "total_hits": 300,
  "occurrence": 5
}

Solution

  • There are several different methods of achieving this all based on the same principle. You need to perform three searches:

    This can be done with multi-search, filter + top_hit aggregation, and with filter + global aggregation. Here is an example of how to achieve that using filter + global aggregation:

    DELETE test
    PUT test
    {
      "mappings": {
        "properties": {
          "line_no": {
            "type": "integer"
          },
          "line": {
            "type": "text"
          }
        }
      }
    }
    
    POST test/_bulk?refresh=true
    { "index": { "_id": "1" } }
    { "line_no": 54, "line": "content"}
    { "index": { "_id": "2" } }
    { "line_no": 55, "line": "content"}
    { "index": { "_id": "3" } }
    { "line_no": 56, "line": "content"}
    { "index": { "_id": "4" } }
    { "line_no": 57, "line": "content"}
    { "index": { "_id": "5" } }
    { "line_no": 58, "line": "foo"}
    { "index": { "_id": "6" } }
    { "line_no": 59, "line": "bar"}
    { "index": { "_id": "7" } }
    { "line_no": 60, "line": "baz"}
    { "index": { "_id": "8" } }
    { "line_no": 61, "line": "content"}
    { "index": { "_id": "9" } }
    { "line_no": 62, "line": "content"}
    { "index": { "_id": "10" } }
    { "line_no": 63, "line": "content"}
    
    
    
    POST test/_search?filter_path=hits.hits,aggregations.all.all_occurrencess.doc_count,aggregations.all.all_occurrences.previous_occurrences.doc_count
    {
      "size": 1,
      "query": {
        "bool": {
          "must": [
            {
              "range": {
                "line_no": {
                  "gt": 59
                }
              }
            },
            {
              "match": {
                "line": "content"
              }
            }
          ]
        }
      },
      "sort": [
        {
          "line_no": {
            "order": "asc"
          }
        }
      ],
      "aggs": {
        "all": {
          "global": {},
          "aggs": {
            "all_occurrences": {
              "filter": {
                "match": {
                  "line": "content"
                }
              },
              "aggs": {
                "previous_occurrences": {
                  "filter": {
                    "range": {
                      "line_no": {
                        "lte": 59
                      }
                    }
                  }
                }
              }
            }
          }
        }
      }
    }
    

    The result of this query will be :

    {
      "hits": {
        "hits": [
          {
            "_index": "test",
            "_id": "8",
            "_score": 1.3829923,
            "_source": {
              "line_no": 61,
              "line": "content"
            },
            "sort": [
              61
            ]
          }
        ]
      },
      "aggregations": {
        "all": {
          "all_occurrences": {
            "previous_occurrences": {
              "doc_count": 4
            }
          }
        }
      }
    }
    

    In the result above hits.hits[0] will represent the next line matching your query after line 59. The aggregations.all.all_occurrences.doc_count will represent the number of line that contain "content" (it was 300 in your theoretical example, but I reduced it to 7 because for the example to be concise). And finally aggregations.all.all_occurrences.previous_occurrences.doc_count represents that number of occurrences that happened before your current line. To get the current occurrence number you will need to add 1 to it.