pythonparquetpyarrowdata-streamfsspec

Streaming and caching tabular data with fsspec, parquet and Pyarrow


I’m trying to stream data from parquet files stored in Dropbox (but it could be somewhere else, S3, gdrive, etc…) and reading in Pandas, while caching it. For that I’m trying to use fsspec for Python

Following these instructions that’s what I’m trying right now:

from fsspec.implementations.arrow import ArrowFSWrapper
from fsspec.implementations.cached import CachingFileSystem
import pandas as pd

cfs = CachingFileSystem(target_protocol="http", cache_storage="cache_fs")
cfs_arrow = ArrowFSWrapper(cfs)

url = "https://www.dropbox.com/s/…./myfile.parquet?dl=0"
f = cfs_arrow.open(url, "rb")
df = pd.read_parquet(f)

but this raises the following error at cfs_arrow.open(url, "rb"):

AttributeError: type object 'HTTPFileSystem' has no attribute 'open_input_stream'

I’ve used fsspec CachingFileSystem before to stream hdf5 data from S3, so I presumed it would work out-of-the-box, but I’m probably doing something wrong.

Can someone help me with that? Or other suggestions on how to accomplish the goal of streaming my tabular data while keeping a cache for fast later access in the same session?


Solution

  • The convenience way to open and pass a file-like object using fsspec alone would be

    with fsspec.open(
        "blockcache::https://www.dropbox.com/s/…./myfile.parquet?dl=0",
        blockcache={"cache_storage": "cache_fs"}
    ) as f:
        df = pd.read_parquet(f)
    

    Of course, instantiating your own filesystem instance is fine too. You may be interested, that there is a dropbox backend to fsspec too, useful for finding and manipulating files. Also, there is an fsspec.parquet module for optimising parquet access when you need only some of the row-groups or columns of the target.