pythonpython-polarspolars

How to specify relevant columns with read_excel


As far as I can tell, the following MRE conforms to the relevant documentation:

import polars


df = polars.read_excel(
    "/Volumes/Spare/foo.xlsx",
    engine="calamine",
    sheet_name="natsav",
    read_options={"header_row": 2},
    columns=(1,2,4,5,6,7), # columns 0 and 3 are not needed
)

print(df.head())

The issue here is that the documentation states that for the columns parameter:

Columns to read from the sheet; if not specified, all columns are read. Can be given as a sequence of column names or indices.

Clearly, a tuple is a sequence. However, running this code results in an exception as follows:

_fastexcel.InvalidParametersError: invalid parameters: `use_columns` callable could not be called (TypeError: 'tuple' object is not callable)

Further research reveals that the required callable should return bool. So:

def colspec(c):
    print(type(c))
    return True

I then change the read_excel call to include columns=colspec.

The program now runs without exception and reveals that the parameter passed is a class of type builtins.ColumnInfoNoDtype.

Unfortunately, I can't find any documentation for that type.

Is the documentation wrong? How is one supposed to used polars.read_excel to load only certain specific columns?


Solution

  • When you use Calamine, which is the default engine and which you've specified explicitly, the docs say (emphasis mine):

    this engine can be used for reading all major types of Excel Workbook (.xlsx, .xlsb, .xls) and is dramatically faster than the other options, using the fastexcel module to bind the Rust-based Calamine parser.

    This corresponds with what the error message tells you:

    _fastexcel.InvalidParametersError: invalid parameters: `use_columns` callable could not be called (TypeError: 'tuple' object is not callable)
    ^^^^^^^^^^                                              ^^^^^^^^^^^
    

    This isn't an error from read_excel itself; the columns argument is being passed through to fastexcel. Looking at fastexcel's docs, we can see the use_columns parameter defined as:

        use_columns: Union[list[str], list[int], str, Callable[[ColumnInfoNoDtype], bool], NoneType] = None,
    

    (This is also where the ColumnInfoNoDtype type is defined, it's again not part of polars itself.)

    and described as:

    Specifies the columns to use. Can either be:

    • None to select all columns
    • A list of strings and ints, the column names and/or indices (starting at 0)
    • A string, a comma separated list of Excel column letters and column ranges (e.g. “A:E” or “A,C,E:F”, which would result in A,B,C,D,E and A,C,E,F)
    • A callable, a function that takes a column and returns a boolean

    Here we can see that, as well as the Callable[[ColumnInfoNoDtype], bool] type your input is apparently being interpreted as, there are list[int] and list[str] options - not Sequence, which would include tuple.

    Presumably the issue in the polars docs is that how exactly these things are going to be interpreted depends on which engine they're going to - they're trying to describe the general inputs, but in this case it seems that the engine is looking for a list specifically.