I'm using modin.pandas to scale pandas for large dataset. However, when using pd.read_csv
to load a 5 MB csv dataset in jupyter notebook
to compare the performance of modin.pandas
and pandas
, it gives unexpected time duration of execution.
modin.pandas
is taking more time than pandas
. Why?
Code-
import modin.pandas as mpd
df = mpd.read_csv(r"C:\Downloads\annual-enterprise-survey-2019-financial-year-provisional-csv.csv")
import pandas as pd
df = pd.read_csv(r"C:\Downloads\annual-enterprise-survey-2019-financial-year-provisional-csv.csv")
Here's the link to the CSV file. I'm using modin version 0.8.3
and pandas version 1.1.5
.
Output screenshot-
System information-
Edit: I tried using a 500 MB CSV file and the result has just slightly improved. The execution time for both modin
and pandas
is almost same now. Is this usual?
It appears that Modin does some initialisation the first time it runs, which would explain why your Modin time was slower than your Pandas time for the 5MB CSV file.
I investigated how long it took to load various sizes of CSV file on a system with four cores, with both Pandas and Modin. Here is the graph of the results for CSV files from 5MB to 100MB:
And for files up to 2GB:
The results show that, on the system tested:
This is the code used to generate the results.
from pathlib import Path
from timeit import timeit
import modin.pandas as mpd
import pandas as pd
def create_input_file(filename, content, repetitions):
path = Path(filename)
if not path.exists():
with path.open("a", encoding="utf-8") as f:
for _ in range(repetitions):
f.write(content)
def create_input_files(min_size, max_size, increment):
content = Path("survey.csv").read_text(encoding="utf-8")
for size in range(min_size, max_size + 1, increment):
create_input_file(
filename="survey{}MB.csv".format(size),
content=content,
repetitions=size // 5,
)
def time_csv_read(module, filename, description):
print(
"{}: {:.2f} seconds".format(
description,
timeit(lambda: getattr(module, "read_csv")(filename), number=1)
)
)
def time_csv_reads(min_size, max_size, increment):
for size in range(min_size, max_size + 1, increment):
time_csv_read(pd, "survey{}MB.csv".format(size), "Pandas {}MB".format(size))
time_csv_read(mpd, "survey{}MB.csv".format(size), "Modin {}MB".format(size))
def main():
min_size1 = 5
max_size1 = 95
increment1 = 5
min_size2 = 100
max_size2 = 2000
increment2 = 100
create_input_files(min_size1, max_size1, increment1)
create_input_files(min_size2, max_size2, increment2)
time_csv_reads(min_size1, max_size1, increment1)
time_csv_reads(min_size2, max_size2, increment2)
if __name__ == "__main__":
main()
And here is the raw output (with warning messages removed):
Pandas 5MB: 0.12 seconds
Modin 5MB: 0.23 seconds
Pandas 10MB: 0.13 seconds
Modin 10MB: 0.12 seconds
Pandas 15MB: 0.19 seconds
Modin 15MB: 0.16 seconds
Pandas 20MB: 0.24 seconds
Modin 20MB: 0.20 seconds
Pandas 25MB: 0.31 seconds
Modin 25MB: 0.25 seconds
Pandas 30MB: 0.37 seconds
Modin 30MB: 0.29 seconds
Pandas 35MB: 0.40 seconds
Modin 35MB: 0.34 seconds
Pandas 40MB: 0.45 seconds
Modin 40MB: 0.37 seconds
Pandas 45MB: 0.51 seconds
Modin 45MB: 0.42 seconds
Pandas 50MB: 0.55 seconds
Modin 50MB: 0.46 seconds
Pandas 55MB: 0.62 seconds
Modin 55MB: 0.50 seconds
Pandas 60MB: 0.67 seconds
Modin 60MB: 0.53 seconds
Pandas 65MB: 0.74 seconds
Modin 65MB: 0.57 seconds
Pandas 70MB: 0.76 seconds
Modin 70MB: 0.61 seconds
Pandas 75MB: 0.87 seconds
Modin 75MB: 0.65 seconds
Pandas 80MB: 0.90 seconds
Modin 80MB: 0.67 seconds
Pandas 85MB: 0.93 seconds
Modin 85MB: 0.73 seconds
Pandas 90MB: 0.97 seconds
Modin 90MB: 0.74 seconds
Pandas 95MB: 1.34 seconds
Modin 95MB: 0.80 seconds
Pandas 100MB: 1.11 seconds
Modin 100MB: 0.83 seconds
Pandas 200MB: 2.21 seconds
Modin 200MB: 1.62 seconds
Pandas 300MB: 3.28 seconds
Modin 300MB: 2.40 seconds
Pandas 400MB: 5.48 seconds
Modin 400MB: 3.25 seconds
Pandas 500MB: 8.61 seconds
Modin 500MB: 3.92 seconds
Pandas 600MB: 8.11 seconds
Modin 600MB: 4.64 seconds
Pandas 700MB: 9.48 seconds
Modin 700MB: 5.70 seconds
Pandas 800MB: 11.40 seconds
Modin 800MB: 6.35 seconds
Pandas 900MB: 12.63 seconds
Modin 900MB: 7.17 seconds
Pandas 1000MB: 13.59 seconds
Modin 1000MB: 7.91 seconds
Pandas 1100MB: 14.84 seconds
Modin 1100MB: 8.63 seconds
Pandas 1200MB: 17.27 seconds
Modin 1200MB: 9.42 seconds
Pandas 1300MB: 17.77 seconds
Modin 1300MB: 10.22 seconds
Pandas 1400MB: 19.38 seconds
Modin 1400MB: 11.15 seconds
Pandas 1500MB: 21.77 seconds
Modin 1500MB: 11.98 seconds
Pandas 1600MB: 26.79 seconds
Modin 1600MB: 12.55 seconds
Pandas 1700MB: 23.55 seconds
Modin 1700MB: 13.66 seconds
Pandas 1800MB: 26.41 seconds
Modin 1800MB: 13.89 seconds
Pandas 1900MB: 28.44 seconds
Modin 1900MB: 15.15 seconds
Pandas 2000MB: 30.58 seconds
Modin 2000MB: 15.71 seconds
The fact that Modin processed the 10MB file faster than the 5MB file suggested to me that Modin does some initialisation work the first time it runs, so I tested this theory by reading the same 5MB file multiple times. The first time took 0.28 seconds, and subsequent times all took 0.08 seconds. You should see a similar difference in performance if you run Modin multiple times in the same Python process.
This initialisation work is different from the type of overhead I was talking about in my comment on your question. I was thinking of code that splits the work into chunks, sends it to each processor, and pieces the results back together when the processors are finished with each chunk. This kind of overhead will occur every time Modin reads a CSV file; the extra work that Modin does the first time it runs must be something else. So once Modin has done its initialisation, it will be worth using it even for files as small as 5MB. For files smaller than that, the kind of overhead I was talking about will likely become a factor, but it will take more investigation to know how small the files need to be for it to make a difference.