pythonelasticsearchelasticsearch-queryupdate-by-query

add the count of doc in list inside python code to a field in elasticsearch


I need to update a field of a doc in Elasticsearch and add the count of that doc in a list inside python code. The weight field contains the count of the doc in a dataset. The dataset needs to be updated from time to time.So the count of each document must be updated too. hashed_ids is a list of document ids that are in the new batch of data. the weight of matched id must be increased by the count of that id in hashed_ids. I tried the code below but it does not work.

hashed_ids = [hashlib.md5(doc.encode('utf-8')).hexdigest() for doc in shingles]
update_with_query_body = {
        "script": {
            "source": "ctx._source.content_completion.weight +=param.count",
            "lang": "painless",
            "param": {
                "count": hashed_ids.count("ctx.['_id']")
            }
        },
        "query": {
            "ids": {
                "values": hashed_ids
            }
        }
    }

for example let say a doc with id=d1b145716ce1b04ea53d1ede9875e05a and weight=5 is already present in index. and also the string d1b145716ce1b04ea53d1ede9875e05a is repeated three times in the hashed_ids so the update_with_query query shown above will match the doc in database. I need to add 3 to 5 and have 8 as final weight


Solution

  • I'm not aware of python but here is an e.g. based solution with a few assumptions. Let's say the following is the hashed_ids extracted:

    hashed_ids = ["id1","id1","id1","id2"]
    

    To use it in terms query we can get just the unique list of ids, i.e.

    hashed_ids_unique = ["id1", "id2"]
    

    Lets assume the doc(s) are indexed with below structure:

    PUT test/_doc/1
    {
      "id": "id1",
      "weight":9
    }
    

    Now we can use update by query as below:

    POST test/_update_by_query
    {
      "query":{
        "terms": {
          "id":["id1","id2"]
        }
      },
      "script":{
        "source":"long weightToAdd = params.hashed_ids.stream().filter(idFromList -> ctx._source.id.equals(idFromList)).count(); ctx._source.weight += weightToAdd;",
        "params":{
          "hashed_ids":["id1","id1","id1","id2"]
        }
      }
    }
    

    Explanation for script:

    The following gives the count of matching ids in the hashed_ids list for the id of the current matching doc.

    long weightToAdd = params.hashed_ids.stream().filter(idFromList -> ctx._source.id.equals(idFromList)).count();
    

    The following adds up the weightToAdd to the existing value of weight in the document.

    ctx._source.weight += weightToAdd;