[SOLVED] Dask (fsspec) reading and concatenating csv files of the same name from multiple zip files

Dask (fsspec) reading and concatenating csv files of the same name from multiple zip files

I have multiple zip files, each of which contain csv files of the same name e.g.

foo1.zip contains a.csv, b.csv, c.csv
foo2.zip contains a.csv, b.csv, c.csv
foo3.zip contains a.csv, b.csv, c.csv etc..

I am looking to read each csv file of a given name from each zip file and concatenate them. So all the a files would be concatenated, all the b files concatenated and so on. (a, b, c will later be merged - I am using Dask as the final file won't fit in memory).

I have tried using the answer here but with swapping the wildcard from the file name to the zip file:

dfa = dd.read_csv('zip://a.csv::foo*.zip' ,delimiter=";", header=0,index_col=False)
result = dfa.compute()
print(result)

However this is only loading the a.csv file from the first zip. I have also tried:

dfa = dd.read_csv("foo*.zip::a.csv",delimiter=";", header=0,index_col=False)

but that seems to read every csv, regardless of the name.

Can somebody please tell me what I'm doing wrong here? Thanks

Expecting dfa = dd.read_csv('zip://a.csv::foo*.zip' ,delimiter=";", header=0,index_col=False) to open and concatenate all 'a.csv' files from all zip files matching foo*.zip.

Result: Only the first a.csv file was returned.

Solution

fsspec only supports performing the glob on the inner-most filesystem (within a ZIP in this case) rather than over multiple possible filesystems.

Furthermore, a single call to read_csv will always produce just one filesystem, so both of your paths are being interpreted in the context of just one zip file, as you have noticed. This was a design decision in early dask, to be able to tokenise tasks on the basis of the one filesystem in play.

To do the task, you will need to go manual:

dfs = [dd.read_csv(fn, ...) for fn in ("zip://a.csv::foo1.zip", "zip://a.csv::fo2.zip")]
ddf = dd.concat(dfs)
result = ddf.compute()