I found weird behavior with dask persist, if I comment out this line
# client = Client(memory_limit='20GB',n_workers=1) # Connect to distributed cluster and override default
and perform
dd_feature_009a013a_normalized_noneedshift = dd_feature_009a013a_normalized_noneedshift.head(1000000,compute=False).persist()
persist operates as expected.. computed and store in memory allowing me to access result instantaneously However, if I uncomment
client = Client(memory_limit='20GB',n_workers=1) # Connect to distributed cluster and override default
Then
dd_feature_009a013a_normalized_noneedshift = dd_feature_009a013a_normalized_noneedshift.head(1000000,compute=False).persist()
dd_feature_009a013a_normalized_noneedshift = client.persist(dd_feature_009a013a_normalized_noneedshift)
is not doing anything. Lazy dataframe is returned immediately... What shall I do to achieve the same behavior when I turn on client = Client(memory_limit='20GB',n_workers=1)
?
When we persist an object with client.persist
, we get back a future, that refers to the results of the computation. Once computed, the result will be stored on a worker or multiple workers, as appropriate. Running client.persist
on an existing future will give back another future... so a reference to a reference to another computation, which is likely unnecessary.
To get the result of a future, one can run .result()
on the future itself. This will block further commands until the future is computed and the results are returned.