daskdask-distributeddask-delayed

what is the default directory where dask workers store results or files.?


[mapr@impetus-i0057 latest_code_deepak]$ dask-worker 172.26.32.37:8786
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.26.32.36:50930'
distributed.diskutils - WARNING - Found stale lock file and directory '/home/mapr/latest_code_deepak/dask-worker-space/worker-PwEseH', purging
distributed.worker - INFO -       Start worker at:   tcp://172.26.32.36:41694
distributed.worker - INFO -          Listening to:   tcp://172.26.32.36:41694
distributed.worker - INFO -              bokeh at:          172.26.32.36:8789
distributed.worker - INFO -              nanny at:         172.26.32.36:50930
distributed.worker - INFO - Waiting to connect to:    tcp://172.26.32.37:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          8
distributed.worker - INFO -                Memory:                   33.52 GB
distributed.worker - INFO -       Local Directory: /home/mapr/latest_code_deepak/dask-worker-spa                                                                 ce/worker-AkBPtM
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to:    tcp://172.26.32.37:8786
distributed.worker - INFO - -------------------------------------------------

what is the default directory where a dask-worker maintains the temporary files, such as task results, or the downloaded files which was uploaded using upload_file() method from the client.?

for example:-

def my_task_running_on_dask_worker():
    //fetch the file from hdfs
    // process the file
    //store the file back into hdfs

Solution

  • By default a dask worker places a directory in ./dask-worker-space/worker-####### where ###### is some random string for that particular worker.

    You can change this location using the --local-directory keyword to the dask-worker executable.

    The warning that you're seeing in this line

    distributed.diskutils - WARNING - Found stale lock file and directory '/home/mapr/latest_code_deepak/dask-worker-space/worker-PwEseH', purging
    

    says that a Dask worker noticed that the directory for another worker wasn't cleaned up, presumably because it failed in some hard way. This worker is cleaning up the space left behind from the previous worker.

    Edit

    You can see which worker creates which directory either by looking at the logs of each worker (They print out their local directory)

    $ dask-worker localhost:8786
    distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:36607
    ...
    distributed.worker - INFO -       Local Directory: /home/mrocklin/dask-worker-space/worker-ks3mljzt
    

    Or programatically by calling client.scheduler_info()

    >>> client.scheduler_info()
    {'address': 'tcp://127.0.0.1:34027',
     'id': 'Scheduler-bd88dfdf-e3f7-4b39-8814-beae779248f1',
     'services': {'bokeh': 8787},
     'type': 'Scheduler',
     'workers': {'tcp://127.0.0.1:33143': {'cpu': 7.7,
        ... 
       'local_directory': '/home/mrocklin/dask-worker-space/worker-8kvk_l81',
      },
    ...