pythonelasticsearchpyelasticsearch

Elastic search not giving data with big number for page size


Size of data to get: 20,000 approx

Issue: searching Elastic Search indexed data using below command in python

but not getting any results back.

from pyelasticsearch import ElasticSearch
es_repo = ElasticSearch(settings.ES_INDEX_URL)
search_results = es_repo.search(
            query, index=advertiser_name, es_from=_from, size=_size)

If I give size less than or equal to 10,000 it works fine but not with 20,000 Please help me find an optimal solution to this.

PS: On digging deeper into ES found this message error:

Result window is too large, from + size must be less than or equal to: [10000] but was [19999]. See the scrolling API for a more efficient way to request large data sets.


Solution

  • for real time use the best solution is to use the search after query . You need only a date field, and another field that uniquely identify a doc - it's enough a _id field or an _uid field. Try something like this, in my example I would like to extract all the documents that belongs to a single user - in my example the user field has a keyword datatype:

    from elasticsearch import Elasticsearch
    
    
    es = Elasticsearch()
    es_index = "your_index_name"
    documento = "your_doc_type"
    
    user = "Francesco Totti"
    
    body2 = {
            "query": {
            "term" : { "user" : user } 
                }
            }
    
    res = es.count(index=es_index, doc_type=documento, body= body2)
    size = res['count']
    
    
    body = { "size": 10,
                "query": {
                    "term" : {
                        "user" : user
                    }
                },
                "sort": [
                    {"date": "asc"},
                    {"_uid": "desc"}
                ]
            }
    
    result = es.search(index=es_index, doc_type=documento, body= body)
    bookmark = [result['hits']['hits'][-1]['sort'][0], str(result['hits']['hits'][-1]['sort'][1]) ]
    
    body1 = {"size": 10,
                "query": {
                    "term" : {
                        "user" : user
                    }
                },
                "search_after": bookmark,
                "sort": [
                    {"date": "asc"},
                    {"_uid": "desc"}
                ]
            }
    
    
    
    
    while len(result['hits']['hits']) < size:
        res =es.search(index=es_index, doc_type=documento, body= body1)
        for el in res['hits']['hits']:
            result['hits']['hits'].append( el )
        bookmark = [res['hits']['hits'][-1]['sort'][0], str(result['hits']['hits'][-1]['sort'][1]) ]
        body1 = {"size": 10,
                "query": {
                    "term" : {
                        "user" : user
                    }
                },
                "search_after": bookmark,
                "sort": [
                    {"date": "asc"},
                    {"_uid": "desc"}
                ]
            }
    

    Then you will find all the doc appended to the result var

    If you would like to use scroll query - doc here:

    from elasticsearch import Elasticsearch, helpers
    
    es = Elasticsearch()
    es_index = "your_index_name"
    documento = "your_doc_type"
    
    user = "Francesco Totti"
    
    body = {
            "query": {
            "term" : { "user" : user } 
                 }
            }
    
    res = helpers.scan(
                    client = es,
                    scroll = '2m',
                    query = body, 
                    index = es_index)
    
    for i in res:
        print(i)