I'm just a Python newbie who's had fun dealing with data with Python.
When I was be able to use Python's representative data tool, Pandas, it seemed that it would be able to work on Excel very quickly.
However, I was somewhat disappointed to see it take more than 1 to 2 minutes to retrieve data(.xlsx) with 470,000 rows, and as a result, I found out that using modin and ray (or dask) would enable faster operation.
After learning how to use it simply as below, I compared it to using Pandas only. (this time, 100M rows data, about 5GB)
import ray
ray.init()
import modin.pandas as md
%%time
TB = md.read_csv('train.csv')
TB
But it only took 1 minute and 3 seconds to write Pandas, but it took 1 minute and 9 seconds to write modin [ray]. I was disappointed to see that it would take longer than just a small difference.
How can I use modin faster than pandas? Complex operations such as groupby or merge? Is there little difference in simply reading data?
Modin is faster to read data when other people are using it, is there something wrong with my computer's settings? I want to know why.
Write down the method installed at the prompt just in case you need it.
!pip install modin[ray]
!pip install ray[default]
First off, to do a fair assessment you always need to use the %%timeit magic command, which gives you an average of multiple runs.
Modin generally works best when you have:
The unimpressive performance, in your case, I believe is largely due to multi-processing management done by Ray/Dask, e.g. worker scheduling and all the set up that goes into parallelisation. When you meet at least one of the 2 criteria above (specially the first, given any current processor) the trade-off between the resource management and the speed up you get from Modin would be in your favour, but nor a 5GB file neither 6 cores are large enough to tip this in your favour. Parallelisation is costly, and the task must be worthy of it.
If it is a one-off, 1-2 minutes is not an unreasonable amount of time at all for this sort of thing. If it is a file that you are going to continuously read and write I would recommend writing it to HDF5 or pickle format in which case your read/write performance will improve far more than just using Modin.
Alternatively, Vaex is the fastest option around for reading any df. Though, I personally think it is still very incomplete and sometimes doesn't match the promises made about it beyond simple numerical-data operations, e.g. when you have large strings in your data.