pythonpandascsvparallel-processingjoblib

Read multiple csv files in a pandas dataframe, in parallel


This question develops on the preceding one:

Read multiple csv files in a pandas dataframe

Basically, I have a set of files like:

file 1:

<empty line>
#-----------------------------------------
#    foo  bar  baz  
#-----------------------------------------
    0.0120932       1.10166       1.08745 
    0.0127890       1.10105       1.08773 
    0.0142051       1.09941       1.08760 
    0.0162801       1.09662       1.08548 
    0.0197376       1.09170       1.08015 

file 2:

<empty line>
#-----------------------------------------
#    foo  bar  baz  
#-----------------------------------------
    0.888085      0.768590      0.747961
    0.893782      0.781607      0.760417
    0.899830      0.797021      0.771219
    0.899266      0.799260      0.765859
    0.891489      0.781255      0.728892

etc. Each file is identified by an ID, and there's a ID to file mapping:

files = {'A': 'A.csv', 'B': 'B.csv'}

Thanks to the other answer, I can read the files serially:

columns = ['foo', 'bar', 'baz']
skip = 4
df = (pd.concat({k: pd.read_csv(v, skiprows=skip, sep=r'\s+', names=names)
                 for k,v in files.items()},
                names=['ID'])
        .reset_index('ID')
        .reset_index(drop=True)
    )

However, I would like to read them in parallel, to take advantage of my multicore machine. A naive attempt doesn't work:

from joblib import Parallel, delayed
from multiprocessing import cpu_count

n_jobs = cpu_count()


def read_file(res_dict: dict,
                          skiprows: int,
                          columns: list[str],
                          id: str,
                          file: Path
                          ) -> None:
    res_dict[id] = pd.read_csv(file, skiprows=skiprows, sep=r'\s+', names=columns)


temp = {}
temp = Parallel(n_jobs)(delayed(read_file)(temp, skip_rows, columns, id, file) for id, file in master2file.items())
df = (pd.concat(temp,
                names=['ID'])
      .reset_index('ID')
      .reset_index(drop=True)
      )

I get the error

Traceback (most recent call last):
  File "/home/...py", line 54, in <module>
    df = (pd.concat(temp,
  File "/home/../.venv/lib/python3.10/site-packages/pandas/core/reshape/concat.py", line 372, in concat
    op = _Concatenator(
  File "/home/../.venv/lib/python3.10/site-packages/pandas/core/reshape/concat.py", line 452, in __init__
    raise ValueError("All objects passed were None")
ValueError: All objects passed were None

Process finished with exit code 1

What am I doing wrong? Can you help me?


Solution

  • Here is a functional version of your code. I didn't use a dictionary but returned the dataframes directly (since you already have the IDs as keys of master2file). I passed the keys as parameter to concat instead of using a dictionary.

    from joblib import Parallel, delayed
    from multiprocessing import cpu_count
    
    n_jobs = cpu_count()
    
    master2file = {'A': 'A.csv', 'B': 'B.csv'}
    skip_rows = 4
    columns = ['foo', 'bar', 'baz']
    
    def read_file(skiprows: int,
                  columns: list[str],
                  id: str,
                  file: Path
                  ) -> None:
        return pd.read_csv(file, skiprows=skiprows, sep=r'\s+', names=columns)
    
    
    temp = Parallel(n_jobs)(delayed(read_file)(skip_rows, columns, id, file)
                                               for id, file in master2file.items())
    df = (pd.concat(temp,
                    keys=list(master2file),
                    names=['ID'])
          .reset_index('ID')
          .reset_index(drop=True)
          )
    

    Output:

      ID       foo       bar       baz
    0  A  0.012093  1.101660  1.087450
    1  A  0.012789  1.101050  1.087730
    2  A  0.014205  1.099410  1.087600
    3  A  0.016280  1.096620  1.085480
    4  A  0.019738  1.091700  1.080150
    5  B  0.888085  0.768590  0.747961
    6  B  0.893782  0.781607  0.760417
    7  B  0.899830  0.797021  0.771219
    8  B  0.899266  0.799260  0.765859
    9  B  0.891489  0.781255  0.728892