If I have 2 .csv files stored locally data/file_1.csv
and data/file_2.csv
which both have the same schema, it is easy to polars-read both of them in to 1 concatenated data frame like so:
pl.read_csv('data/file_*.csv')
But if I am storing these same 2 files within Google Drive (not a GCS bucket), and I am using GDriveFileSystem
from pydrive2.fs
as my fsspec file system, I cannot find a way to make use of the glob pattern and have to read them in separately, e.g.
fs = GDriveFileSystem(ROOT_FOLDER_ID, client_id = CLIENT_ID, client_secret = CLIENT_SECRET)
dfs = []
for i in range(1, 3):
with fs.open(f'{ROOT_FOLDER_ID}/data/file_{i}.csv', 'rb') as f:
dfs += pl.read_csv(f)
df = pl.concat(dfs)
Not only does this mean I need to know and specify the amount of files and their exact file paths in advance, but the code also just feels a lot less cleaner than before.
Is there any way I can still read these multiple files with a glob path but using the fsspec file system?
Although pydrive2 has an fsspec interface, it doesn't seem to declare a protocol or register itself with fsspec, so calls like fsspec.open("gdrive://...", )
are not automatically recognised. This is the intended usage, so I suggest an issue should be raised with them to make sure this gets implemented. You could call fsspec.register_implementation
manually to assign a protocol to the PyDrive2 fsspec class.
The older, less complete and unreleased gdrivefs does support this usage, because it is listed in fsspec.registry.known_implementations.