pythonpandasparallel-processingjupyter-notebookmodin

Modin is taking more time than pandas for reading CSV


I'm using modin.pandas to scale pandas for large dataset. However, when using pd.read_csv to load a 5 MB csv dataset in jupyter notebook to compare the performance of modin.pandas and pandas, it gives unexpected time duration of execution.

modin.pandas is taking more time than pandas. Why?

Code-

import modin.pandas as mpd
df = mpd.read_csv(r"C:\Downloads\annual-enterprise-survey-2019-financial-year-provisional-csv.csv")

import pandas as pd
df = pd.read_csv(r"C:\Downloads\annual-enterprise-survey-2019-financial-year-provisional-csv.csv")

Here's the link to the CSV file. I'm using modin version 0.8.3 and pandas version 1.1.5.

Output screenshot-

Jupyter notebook output

System information-

System information

Edit: I tried using a 500 MB CSV file and the result has just slightly improved. The execution time for both modin and pandas is almost same now. Is this usual?


Solution

  • It appears that Modin does some initialisation the first time it runs, which would explain why your Modin time was slower than your Pandas time for the 5MB CSV file.

    I investigated how long it took to load various sizes of CSV file on a system with four cores, with both Pandas and Modin. Here is the graph of the results for CSV files from 5MB to 100MB:

    Graph of Pandas/Modin CSV file processing times for files from 5MB to 100MB

    And for files up to 2GB:

    Graph of Pandas/Modin CSV file processing times for files up to 2GB

    The results show that, on the system tested:

    This is the code used to generate the results.

    from pathlib import Path
    from timeit import timeit
    
    import modin.pandas as mpd
    import pandas as pd
    
    def create_input_file(filename, content, repetitions):
        path = Path(filename)
        if not path.exists():
            with path.open("a", encoding="utf-8") as f:
                for _ in range(repetitions):
                    f.write(content)
    
    def create_input_files(min_size, max_size, increment):
        content = Path("survey.csv").read_text(encoding="utf-8")
        for size in range(min_size, max_size + 1, increment):
            create_input_file(
                filename="survey{}MB.csv".format(size),
                content=content,
                repetitions=size // 5,
            )
    
    def time_csv_read(module, filename, description):
        print(
            "{}: {:.2f} seconds".format(
                description,
                timeit(lambda: getattr(module, "read_csv")(filename), number=1)
            )
        )
    
    def time_csv_reads(min_size, max_size, increment):
        for size in range(min_size, max_size + 1, increment):
            time_csv_read(pd, "survey{}MB.csv".format(size), "Pandas {}MB".format(size))
            time_csv_read(mpd, "survey{}MB.csv".format(size), "Modin {}MB".format(size))
    
    def main():
        min_size1 = 5
        max_size1 = 95
        increment1 = 5
        min_size2 = 100
        max_size2 = 2000
        increment2 = 100
        create_input_files(min_size1, max_size1, increment1)
        create_input_files(min_size2, max_size2, increment2)
        time_csv_reads(min_size1, max_size1, increment1)
        time_csv_reads(min_size2, max_size2, increment2)
    
    if __name__ == "__main__":
        main()
    

    And here is the raw output (with warning messages removed):

    Pandas 5MB: 0.12 seconds
    Modin 5MB: 0.23 seconds
    Pandas 10MB: 0.13 seconds
    Modin 10MB: 0.12 seconds
    Pandas 15MB: 0.19 seconds
    Modin 15MB: 0.16 seconds
    Pandas 20MB: 0.24 seconds
    Modin 20MB: 0.20 seconds
    Pandas 25MB: 0.31 seconds
    Modin 25MB: 0.25 seconds
    Pandas 30MB: 0.37 seconds
    Modin 30MB: 0.29 seconds
    Pandas 35MB: 0.40 seconds
    Modin 35MB: 0.34 seconds
    Pandas 40MB: 0.45 seconds
    Modin 40MB: 0.37 seconds
    Pandas 45MB: 0.51 seconds
    Modin 45MB: 0.42 seconds
    Pandas 50MB: 0.55 seconds
    Modin 50MB: 0.46 seconds
    Pandas 55MB: 0.62 seconds
    Modin 55MB: 0.50 seconds
    Pandas 60MB: 0.67 seconds
    Modin 60MB: 0.53 seconds
    Pandas 65MB: 0.74 seconds
    Modin 65MB: 0.57 seconds
    Pandas 70MB: 0.76 seconds
    Modin 70MB: 0.61 seconds
    Pandas 75MB: 0.87 seconds
    Modin 75MB: 0.65 seconds
    Pandas 80MB: 0.90 seconds
    Modin 80MB: 0.67 seconds
    Pandas 85MB: 0.93 seconds
    Modin 85MB: 0.73 seconds
    Pandas 90MB: 0.97 seconds
    Modin 90MB: 0.74 seconds
    Pandas 95MB: 1.34 seconds
    Modin 95MB: 0.80 seconds
    Pandas 100MB: 1.11 seconds
    Modin 100MB: 0.83 seconds
    Pandas 200MB: 2.21 seconds
    Modin 200MB: 1.62 seconds
    Pandas 300MB: 3.28 seconds
    Modin 300MB: 2.40 seconds
    Pandas 400MB: 5.48 seconds
    Modin 400MB: 3.25 seconds
    Pandas 500MB: 8.61 seconds
    Modin 500MB: 3.92 seconds
    Pandas 600MB: 8.11 seconds
    Modin 600MB: 4.64 seconds
    Pandas 700MB: 9.48 seconds
    Modin 700MB: 5.70 seconds
    Pandas 800MB: 11.40 seconds
    Modin 800MB: 6.35 seconds
    Pandas 900MB: 12.63 seconds
    Modin 900MB: 7.17 seconds
    Pandas 1000MB: 13.59 seconds
    Modin 1000MB: 7.91 seconds
    Pandas 1100MB: 14.84 seconds
    Modin 1100MB: 8.63 seconds
    Pandas 1200MB: 17.27 seconds
    Modin 1200MB: 9.42 seconds
    Pandas 1300MB: 17.77 seconds
    Modin 1300MB: 10.22 seconds
    Pandas 1400MB: 19.38 seconds
    Modin 1400MB: 11.15 seconds
    Pandas 1500MB: 21.77 seconds
    Modin 1500MB: 11.98 seconds
    Pandas 1600MB: 26.79 seconds
    Modin 1600MB: 12.55 seconds
    Pandas 1700MB: 23.55 seconds
    Modin 1700MB: 13.66 seconds
    Pandas 1800MB: 26.41 seconds
    Modin 1800MB: 13.89 seconds
    Pandas 1900MB: 28.44 seconds
    Modin 1900MB: 15.15 seconds
    Pandas 2000MB: 30.58 seconds
    Modin 2000MB: 15.71 seconds
    

    The fact that Modin processed the 10MB file faster than the 5MB file suggested to me that Modin does some initialisation work the first time it runs, so I tested this theory by reading the same 5MB file multiple times. The first time took 0.28 seconds, and subsequent times all took 0.08 seconds. You should see a similar difference in performance if you run Modin multiple times in the same Python process.

    This initialisation work is different from the type of overhead I was talking about in my comment on your question. I was thinking of code that splits the work into chunks, sends it to each processor, and pieces the results back together when the processors are finished with each chunk. This kind of overhead will occur every time Modin reads a CSV file; the extra work that Modin does the first time it runs must be something else. So once Modin has done its initialisation, it will be worth using it even for files as small as 5MB. For files smaller than that, the kind of overhead I was talking about will likely become a factor, but it will take more investigation to know how small the files need to be for it to make a difference.