Im learning how to work with large datasets, so im using modin.pandas.
I'm doing some aggregation, after which a 50GB dataset is hopefully going to become closer to 5GB in size - and now i need to check: if the df is small enough to fit in RAM, i want to cast it to pandas and enjoy a bug-free reliable library.
So, naturally, the question is: how to check it? .memory_usage(deep=True).sum()
tells me how much the whole df uses, but i cant possibly know from that one number how much of it is in RAM, and how much is in swap - in other words, how much space do i need for casting the df to pandas. Are there other ways? Am i even right to assume that some partitions live in RAM while others - in swap? How to calculate how much data will flood the RAM when i call ._to_pandas()
? Is there a hidden .__memory_usage_in_swap_that_needs_to_fit_in_ram()
of some sorts?
Am i even right to assume that some partitions live in RAM while others - in swap?
Modin doesn't specify whether data should be in RAM or swap.
On Ray, it uses ray.put
to store partitions. ray.put
doesn't give any guarantees about where the data will go. Note that Ray spills data blocks to disk when they are too large for its in-memory object store. You can use ray memory
to get a summary of how much of each storage Ray is using.
On Dask, modin uses dask.Client.scatter
, which also doesn't give guarantees about where the data will go, to store partition data. I don't know any way to figure out how much of the stored data is really in RAM.