pythondataframemultiprocessingdasklocal

Dask nanny memory error - Worker is too slow to terminate


I am trying to open a jsonl file with Dask and when I first run the program I get a warning that a worker is using too much memory than allocated then eventually the nanny tries to terminate the worker but fails and then the whole program eventually crashes with the program eventually saying that all 4 workers died trying to run it.

I tried asking chat-gpt and looking through the documentation for answers but I couldn't find the worker logs or any info on how to solve the issue but to no avail the only suggestion I saw was disabling the nanny but I decided that it wouldn't be a good idea as their must be something fundamentally wrong with my code so I turned to stack overflow for answers because i'm lost.

Terminal output(1) Terminal output(2)

from dask.distributed import LocalCluster
import dask.dataframe as dd
from multiprocessing import freeze_support

if __name__ == '__main__':
    freeze_support()
    cluster = LocalCluster(n_workers=2,processes=True,threads_per_worker=200)          
    client = cluster.get_client()

    df = dd.read_json("merged_en.jsonl")
    df.x.sum().compute()
    client.close()
    cluster.close()

Solution

  • So what happens here is that you're trying to load the file in a single partition, and this file is too big to fit on the memory of one worker.

    If the file is JSON-lines data (as could be expected given the jsonl extension), you should use lines kwarg and specify a blocksize:

    df = dd.read_json("merged_en.jsonl", lines=True, blocksize="128 MiB")
    

    See the documentation for more informations.