I have multiple zip files, each of which contain csv files of the same name e.g.
I am looking to read each csv file of a given name from each zip file and concatenate them. So all the a files would be concatenated, all the b files concatenated and so on. (a, b, c will later be merged - I am using Dask as the final file won't fit in memory).
I have tried using the answer here but with swapping the wildcard from the file name to the zip file:
dfa = dd.read_csv('zip://a.csv::foo*.zip' ,delimiter=";", header=0,index_col=False)
result = dfa.compute()
print(result)
However this is only loading the a.csv file from the first zip. I have also tried:
dfa = dd.read_csv("foo*.zip::a.csv",delimiter=";", header=0,index_col=False)
but that seems to read every csv, regardless of the name.
Can somebody please tell me what I'm doing wrong here? Thanks
Expecting dfa = dd.read_csv('zip://a.csv::foo*.zip' ,delimiter=";", header=0,index_col=False)
to open and concatenate all 'a.csv' files from all zip files matching foo*.zip.
Result: Only the first a.csv file was returned.
fsspec only supports performing the glob on the inner-most filesystem (within a ZIP in this case) rather than over multiple possible filesystems.
Furthermore, a single call to read_csv will always produce just one filesystem, so both of your paths are being interpreted in the context of just one zip file, as you have noticed. This was a design decision in early dask, to be able to tokenise tasks on the basis of the one filesystem in play.
To do the task, you will need to go manual:
dfs = [dd.read_csv(fn, ...) for fn in ("zip://a.csv::foo1.zip", "zip://a.csv::fo2.zip")]
ddf = dd.concat(dfs)
result = ddf.compute()