I’m trying to stream data from parquet files stored in Dropbox (but it could be somewhere else, S3, gdrive, etc…) and reading in Pandas, while caching it. For that I’m trying to use fsspec for Python
Following these instructions that’s what I’m trying right now:
from fsspec.implementations.arrow import ArrowFSWrapper
from fsspec.implementations.cached import CachingFileSystem
import pandas as pd
cfs = CachingFileSystem(target_protocol="http", cache_storage="cache_fs")
cfs_arrow = ArrowFSWrapper(cfs)
url = "https://www.dropbox.com/s/…./myfile.parquet?dl=0"
f = cfs_arrow.open(url, "rb")
df = pd.read_parquet(f)
but this raises the following error at cfs_arrow.open(url, "rb")
:
AttributeError: type object 'HTTPFileSystem' has no attribute 'open_input_stream'
I’ve used fsspec CachingFileSystem
before to stream hdf5 data from S3, so I presumed it would work out-of-the-box, but I’m probably doing something wrong.
Can someone help me with that? Or other suggestions on how to accomplish the goal of streaming my tabular data while keeping a cache for fast later access in the same session?
The convenience way to open and pass a file-like object using fsspec alone would be
with fsspec.open(
"blockcache::https://www.dropbox.com/s/…./myfile.parquet?dl=0",
blockcache={"cache_storage": "cache_fs"}
) as f:
df = pd.read_parquet(f)
Of course, instantiating your own filesystem instance is fine too. You may be interested, that there is a dropbox backend to fsspec too, useful for finding and manipulating files. Also, there is an fsspec.parquet
module for optimising parquet access when you need only some of the row-groups or columns of the target.