As far as I can tell, the following MRE conforms to the relevant documentation:
import polars
df = polars.read_excel(
"/Volumes/Spare/foo.xlsx",
engine="calamine",
sheet_name="natsav",
read_options={"header_row": 2},
columns=(1,2,4,5,6,7), # columns 0 and 3 are not needed
)
print(df.head())
The issue here is that the documentation states that for the columns
parameter:
Columns to read from the sheet; if not specified, all columns are read. Can be given as a sequence of column names or indices.
Clearly, a tuple is a sequence. However, running this code results in an exception as follows:
_fastexcel.InvalidParametersError: invalid parameters: `use_columns` callable could not be called (TypeError: 'tuple' object is not callable)
Further research reveals that the required callable should return bool
. So:
def colspec(c):
print(type(c))
return True
I then change the read_excel
call to include columns=colspec
.
The program now runs without exception and reveals that the parameter passed is a class of type builtins.ColumnInfoNoDtype
.
Unfortunately, I can't find any documentation for that type.
Is the documentation wrong? How is one supposed to used polars.read_excel
to load only certain specific columns?
When you use Calamine, which is the default engine and which you've specified explicitly, the docs say (emphasis mine):
this engine can be used for reading all major types of Excel Workbook (
.xlsx
,.xlsb
,.xls
) and is dramatically faster than the other options, using thefastexcel
module to bind the Rust-based Calamine parser.
This corresponds with what the error message tells you:
_fastexcel.InvalidParametersError: invalid parameters: `use_columns` callable could not be called (TypeError: 'tuple' object is not callable)
^^^^^^^^^^ ^^^^^^^^^^^
This isn't an error from read_excel
itself; the columns
argument is being passed through to fastexcel
. Looking at fastexcel
's docs, we can see the use_columns
parameter defined as:
use_columns: Union[list[str], list[int], str, Callable[[ColumnInfoNoDtype], bool], NoneType] = None,
(This is also where the ColumnInfoNoDtype
type is defined, it's again not part of polars
itself.)
and described as:
Specifies the columns to use. Can either be:
None
to select all columns- A list of strings and ints, the column names and/or indices (starting at 0)
- A string, a comma separated list of Excel column letters and column ranges (e.g.
“A:E”
or“A,C,E:F”
, which would result inA,B,C,D,E
andA,C,E,F
)- A callable, a function that takes a column and returns a boolean
Here we can see that, as well as the Callable[[ColumnInfoNoDtype], bool]
type your input is apparently being interpreted as, there are list[int]
and list[str]
options - not Sequence
, which would include tuple
.
Presumably the issue in the polars
docs is that how exactly these things are going to be interpreted depends on which engine they're going to - they're trying to describe the general inputs, but in this case it seems that the engine is looking for a list specifically.