pythonpython-polarsread-csv

Contradictory error when using Polars read_csv() with multiple files for csv.gz


I'm trying to read multiple csv.gz files into a dataframe but it's not working as I expect.

When I use this globbing pattern:

pl.read_csv('folder_1\*.csv.gz')

It returns this error:

ComputeError: cannot scan compressed csv; use read_csv for compressed data

This error occurred with the following context stack: >[1] 'csv scan' failed [2] 'select' input failed to resolve

Which is strange considering I'm using the very function they suggest. However, passing this globbing pattern for csv works completely fine:

pl.read_csv('folder_1\*.csv')

How can I get around this? I'm currently just using glob.glob() and iterating through the list but I thought it'll look neater without it.


Solution

  • When I pass a glob string blah/blah/blah/*.csv.gz to pl.read_csv, it passes this to pl.scan_csv because it is a glob string. See polars.io.csv.functions line 514 et seq in version 1.1.0.

    There are two separate questions here:

    But put the two questions together and it turns out pl.scan_csv does not support compressed files at all. This is an open issue.

    If you want a one liner for reading your CSVs, you will have to fall back on something like a list comprehension with eager execution:

    from glob import glob
    l = [pl.read_csv(i) for i in glob('*.csv.gz')]
    

    Then do what you will with the list of CSVs (eg pl.concat).