pythonpython-3.xdaskdask-distributed

dask persist behavior inconsistent


I found weird behavior with dask persist, if I comment out this line

# client = Client(memory_limit='20GB',n_workers=1)  # Connect to distributed cluster and override default

and perform

dd_feature_009a013a_normalized_noneedshift = dd_feature_009a013a_normalized_noneedshift.head(1000000,compute=False).persist()

persist operates as expected.. computed and store in memory allowing me to access result instantaneously However, if I uncomment

client = Client(memory_limit='20GB',n_workers=1)  # Connect to distributed cluster and override default

Then

dd_feature_009a013a_normalized_noneedshift = dd_feature_009a013a_normalized_noneedshift.head(1000000,compute=False).persist()
dd_feature_009a013a_normalized_noneedshift = client.persist(dd_feature_009a013a_normalized_noneedshift)

is not doing anything. Lazy dataframe is returned immediately... What shall I do to achieve the same behavior when I turn on client = Client(memory_limit='20GB',n_workers=1)?


Solution

  • When we persist an object with client.persist, we get back a future, that refers to the results of the computation. Once computed, the result will be stored on a worker or multiple workers, as appropriate. Running client.persist on an existing future will give back another future... so a reference to a reference to another computation, which is likely unnecessary.

    To get the result of a future, one can run .result() on the future itself. This will block further commands until the future is computed and the results are returned.