In my application, I want to generate a preview of spreadsheet files that users upload in a memory-efficient manner.
I'm testing the following script using pytest-memray
:
import pytest
import pandas as pd
import polars as pl
@pytest.fixture
def path() -> str:
return "files/rows.csv"
@pytest.fixture
def path_xlsx() -> str:
return "files/bankdataset.xlsx"
def test_load_pandas(path: str):
pd.read_csv(path)
def test_load_polars(path: str):
pl.scan_csv(path, low_memory=True).collect()
def test_load_pandas_xlsx(path_xlsx: str):
pd.read_excel(path_xlsx, sheet_name=None)
def test_load_polars_xlsx(path_xlsx: str):
pl.read_excel(path_xlsx, sheet_name=None)
def test_load_pandas_partial(path: str):
df = pd.read_csv(path, nrows=20)
assert len(df) == 20
def test_load_polars_partial(path: str):
df = pl.scan_csv(path).head(n=20).collect()
assert len(df) == 20
def test_load_pandas_partial_xlsx(path_xlsx: str):
df = pd.read_excel(path_xlsx, nrows=20)
assert len(df) == 20
def test_load_polars_partial_xlsx(path_xlsx: str):
df = pl.read_excel(path_xlsx).head(n=20)
assert len(df) == 20
def test_load_polars_partial_buffer(path: str):
with io.BytesIO() as buffer:
pl.scan_csv(path).limit(20).sink_csv(buffer)
df = pl.read_csv(buffer.getvalue())
assert len(df) == 20
files/rows.csv is around 30k rows, and files/bankdataset.xlsx is 1 million rows.
This outputs:
================================================================================================================ MEMRAY REPORT ================================================================================================================
Allocation results for tests/test_load.py::test_load_polars_xlsx at the high watermark
📦 Total memory allocated: 473.5MiB
📏 Total allocations: 21
📊 Histogram of allocation sizes: |▁█▆ |
🥇 Biggest allocating functions:
- load_sheet_eager:/home/monopoly/workspace/toys/sheetz/.venv/lib/python3.13/site-packages/fastexcel/__init__.py:424 -> 473.4MiB
- _call_with_frames_removed:<frozen importlib._bootstrap>:488 -> 20.4KiB
- read_excel:/home/monopoly/workspace/toys/sheetz/.venv/lib/python3.13/site-packages/fastexcel/__init__.py:514 -> 13.9KiB
- _compile_bytecode:<frozen importlib._bootstrap_external>:784 -> 9.6KiB
- inner:/home/linuxbrew/.linuxbrew/opt/python@3.13/lib/python3.13/typing.py:429 -> 9.0KiB
Allocation results for tests/test_load.py::test_load_polars_partial_xlsx at the high watermark
📦 Total memory allocated: 473.4MiB
📏 Total allocations: 7
📊 Histogram of allocation sizes: |█ |
🥇 Biggest allocating functions:
- load_sheet_eager:/home/monopoly/workspace/toys/sheetz/.venv/lib/python3.13/site-packages/fastexcel/__init__.py:424 -> 473.4MiB
- read_excel:/home/monopoly/workspace/toys/sheetz/.venv/lib/python3.13/site-packages/fastexcel/__init__.py:514 -> 13.9KiB
- _read_spreadsheet:/home/monopoly/workspace/toys/sheetz/.venv/lib/python3.13/site-packages/polars/io/spreadsheet/functions.py:684 -> 536.0B
Allocation results for tests/test_load.py::test_load_pandas_xlsx at the high watermark
📦 Total memory allocated: 426.2MiB
📏 Total allocations: 871
📊 Histogram of allocation sizes: | █▂ |
🥇 Biggest allocating functions:
- feed:/home/linuxbrew/.linuxbrew/opt/python@3.13/lib/python3.13/xml/etree/ElementTree.py:1291 -> 245.1MiB
- parse_cell:/home/monopoly/workspace/toys/sheetz/.venv/lib/python3.13/site-packages/openpyxl/worksheet/_reader.py:244 -> 63.0MiB
- maybe_infer_to_datetimelike:/home/monopoly/workspace/toys/sheetz/.venv/lib/python3.13/site-packages/pandas/core/dtypes/cast.py:1198 -> 39.3MiB
- _rows_to_cols:/home/monopoly/workspace/toys/sheetz/.venv/lib/python3.13/site-packages/pandas/io/parsers/python_parser.py:1066 -> 38.3MiB
- _infer_types:/home/monopoly/workspace/toys/sheetz/.venv/lib/python3.13/site-packages/pandas/io/parsers/base_parser.py:720 -> 15.3MiB
Allocation results for tests/test_load.py::test_load_polars at the high watermark
📦 Total memory allocated: 172.5MiB
📏 Total allocations: 92
📊 Histogram of allocation sizes: |█ |
🥇 Biggest allocating functions:
- collect:/home/monopoly/workspace/toys/sheetz/.venv/lib/python3.13/site-packages/polars/lazyframe/frame.py:2332 -> 96.0B
Allocation results for tests/test_load.py::test_load_pandas at the high watermark
📦 Total memory allocated: 24.6MiB
📏 Total allocations: 25
📊 Histogram of allocation sizes: |█ ▁|
🥇 Biggest allocating functions:
- read:/home/monopoly/workspace/toys/sheetz/.venv/lib/python3.13/site-packages/pandas/io/parsers/c_parser_wrapper.py:234 -> 18.6MiB
- __init__:/home/monopoly/workspace/toys/sheetz/.venv/lib/python3.13/site-packages/pandas/io/parsers/c_parser_wrapper.py:93 -> 6.0MiB
- get_handle:/home/monopoly/workspace/toys/sheetz/.venv/lib/python3.13/site-packages/pandas/io/common.py:873 -> 4.0KiB
- _clean_options:/home/monopoly/workspace/toys/sheetz/.venv/lib/python3.13/site-packages/pandas/io/parsers/readers.py:1688 -> 1.5KiB
- read_csv:/home/monopoly/workspace/toys/sheetz/.venv/lib/python3.13/site-packages/pandas/io/parsers/readers.py:1009 -> 1.5KiB
Allocation results for tests/test_load.py::test_load_polars_partial_buffer at the high watermark
📦 Total memory allocated: 31.5MiB
📏 Total allocations: 16
📊 Histogram of allocation sizes: |▁█ |
🥇 Biggest allocating functions:
- _check_empty:/home/monopoly/workspace/toys/sheetz/.venv/lib/python3.13/site-packages/polars/io/_utils.py:282 -> 1.3KiB
- read_csv:/home/monopoly/workspace/toys/sheetz/.venv/lib/python3.13/site-packages/polars/io/csv/functions.py:572 -> 768.0B
- read_csv:/home/monopoly/workspace/toys/sheetz/.venv/lib/python3.13/site-packages/polars/io/csv/functions.py:549 -> 728.0B
It appears that pandas beats polars in every aspect as far as memory efficiency. Is there something that I'm doing wrong? I'm trying to take advantage of LazyFrames with pl.scan_csv
, but even that doesn't seem to help.
If the goal is just to preview the "first N rows": csv from stdlib and using openpyxl directly resulted in the least memory for me.
With regards to the excel comparisons:
Pandas read_excel
uses openpyxl by default and it breaks early if nrows
is set.
if file_rows_needed is not None and len(data) >= file_rows_needed:
break
Polars read_excel
used fastexcel as the default engine.
It does have its own n_rows
which you can set by using pl.read_excel(..., read_options={"n_rows": 20})
but it did not change memory usage in my testing.
Polars also supports engine="openpyxl"
but from what I can see it always reads all of the data:
openpyxl iter_rows()
does have a max_row=
argument which used slightly less memory in my testing compared to the "early break".
def test_openpyxl_partial_max_row(path_xlsx: str):
default_kwargs = {"read_only": True, "data_only": True, "keep_links": False}
wb = load_workbook(path_xlsx, **default_kwargs)
for name in wb.sheetnames:
ws = wb[name]
for row in ws.iter_rows(max_row=21):
text = [ cell.value for cell in row ]