pythonpython-polarspydrivefsspecpydrive2

How to use glob pattern to read many CSVs into one Polars data frame with pydrive2 fsspec?


If I have 2 .csv files stored locally data/file_1.csv and data/file_2.csv which both have the same schema, it is easy to polars-read both of them in to 1 concatenated data frame like so:

pl.read_csv('data/file_*.csv')

But if I am storing these same 2 files within Google Drive (not a GCS bucket), and I am using GDriveFileSystem from pydrive2.fs as my fsspec file system, I cannot find a way to make use of the glob pattern and have to read them in separately, e.g.

fs = GDriveFileSystem(ROOT_FOLDER_ID, client_id = CLIENT_ID, client_secret = CLIENT_SECRET)

dfs = []
for i in range(1, 3):
with fs.open(f'{ROOT_FOLDER_ID}/data/file_{i}.csv', 'rb') as f:
    dfs += pl.read_csv(f)
df = pl.concat(dfs)

Not only does this mean I need to know and specify the amount of files and their exact file paths in advance, but the code also just feels a lot less cleaner than before.

Is there any way I can still read these multiple files with a glob path but using the fsspec file system?


Solution

  • Although pydrive2 has an fsspec interface, it doesn't seem to declare a protocol or register itself with fsspec, so calls like fsspec.open("gdrive://...", ) are not automatically recognised. This is the intended usage, so I suggest an issue should be raised with them to make sure this gets implemented. You could call fsspec.register_implementation manually to assign a protocol to the PyDrive2 fsspec class.

    The older, less complete and unreleased gdrivefs does support this usage, because it is listed in fsspec.registry.known_implementations.