pythondaskfsspec

Dask (fsspec) reading and concatenating csv files of the same name from multiple zip files


I have multiple zip files, each of which contain csv files of the same name e.g.

I am looking to read each csv file of a given name from each zip file and concatenate them. So all the a files would be concatenated, all the b files concatenated and so on. (a, b, c will later be merged - I am using Dask as the final file won't fit in memory).

I have tried using the answer here but with swapping the wildcard from the file name to the zip file:

dfa = dd.read_csv('zip://a.csv::foo*.zip' ,delimiter=";", header=0,index_col=False)
result = dfa.compute()
print(result)

However this is only loading the a.csv file from the first zip. I have also tried:

dfa = dd.read_csv("foo*.zip::a.csv",delimiter=";", header=0,index_col=False)

but that seems to read every csv, regardless of the name.

Can somebody please tell me what I'm doing wrong here? Thanks

Expecting dfa = dd.read_csv('zip://a.csv::foo*.zip' ,delimiter=";", header=0,index_col=False) to open and concatenate all 'a.csv' files from all zip files matching foo*.zip.

Result: Only the first a.csv file was returned.


Solution

  • fsspec only supports performing the glob on the inner-most filesystem (within a ZIP in this case) rather than over multiple possible filesystems.

    Furthermore, a single call to read_csv will always produce just one filesystem, so both of your paths are being interpreted in the context of just one zip file, as you have noticed. This was a design decision in early dask, to be able to tokenise tasks on the basis of the one filesystem in play.

    To do the task, you will need to go manual:

    dfs = [dd.read_csv(fn, ...) for fn in ("zip://a.csv::foo1.zip", "zip://a.csv::fo2.zip")]
    ddf = dd.concat(dfs)
    result = ddf.compute()