pythonpython-polars

Polars is less memory efficient than pandas


In my application, I want to generate a preview of spreadsheet files that users upload in a memory-efficient manner.

I'm testing the following script using pytest-memray:

import pytest
import pandas as pd
import polars as pl


@pytest.fixture
def path() -> str:
    return "files/rows.csv"


@pytest.fixture
def path_xlsx() -> str:
    return "files/bankdataset.xlsx"


def test_load_pandas(path: str):
    pd.read_csv(path)


def test_load_polars(path: str):
    pl.scan_csv(path, low_memory=True).collect()


def test_load_pandas_xlsx(path_xlsx: str):
    pd.read_excel(path_xlsx, sheet_name=None)


def test_load_polars_xlsx(path_xlsx: str):
    pl.read_excel(path_xlsx, sheet_name=None)


def test_load_pandas_partial(path: str):
    df = pd.read_csv(path, nrows=20)
    assert len(df) == 20


def test_load_polars_partial(path: str):
    df = pl.scan_csv(path).head(n=20).collect()
    assert len(df) == 20


def test_load_pandas_partial_xlsx(path_xlsx: str):
    df = pd.read_excel(path_xlsx, nrows=20)
    assert len(df) == 20


def test_load_polars_partial_xlsx(path_xlsx: str):
    df = pl.read_excel(path_xlsx).head(n=20)
    assert len(df) == 20


def test_load_polars_partial_buffer(path: str):
    with io.BytesIO() as buffer:
        pl.scan_csv(path).limit(20).sink_csv(buffer)
        df = pl.read_csv(buffer.getvalue())
        assert len(df) == 20

files/rows.csv is around 30k rows, and files/bankdataset.xlsx is 1 million rows.

This outputs:

================================================================================================================ MEMRAY REPORT ================================================================================================================
Allocation results for tests/test_load.py::test_load_polars_xlsx at the high watermark

         📦 Total memory allocated: 473.5MiB
         📏 Total allocations: 21
         📊 Histogram of allocation sizes: |▁█▆  |
         🥇 Biggest allocating functions:
                - load_sheet_eager:/home/monopoly/workspace/toys/sheetz/.venv/lib/python3.13/site-packages/fastexcel/__init__.py:424 -> 473.4MiB
                - _call_with_frames_removed:<frozen importlib._bootstrap>:488 -> 20.4KiB
                - read_excel:/home/monopoly/workspace/toys/sheetz/.venv/lib/python3.13/site-packages/fastexcel/__init__.py:514 -> 13.9KiB
                - _compile_bytecode:<frozen importlib._bootstrap_external>:784 -> 9.6KiB
                - inner:/home/linuxbrew/.linuxbrew/opt/python@3.13/lib/python3.13/typing.py:429 -> 9.0KiB


Allocation results for tests/test_load.py::test_load_polars_partial_xlsx at the high watermark

         📦 Total memory allocated: 473.4MiB
         📏 Total allocations: 7
         📊 Histogram of allocation sizes: |█  |
         🥇 Biggest allocating functions:
                - load_sheet_eager:/home/monopoly/workspace/toys/sheetz/.venv/lib/python3.13/site-packages/fastexcel/__init__.py:424 -> 473.4MiB
                - read_excel:/home/monopoly/workspace/toys/sheetz/.venv/lib/python3.13/site-packages/fastexcel/__init__.py:514 -> 13.9KiB
                - _read_spreadsheet:/home/monopoly/workspace/toys/sheetz/.venv/lib/python3.13/site-packages/polars/io/spreadsheet/functions.py:684 -> 536.0B


Allocation results for tests/test_load.py::test_load_pandas_xlsx at the high watermark

         📦 Total memory allocated: 426.2MiB
         📏 Total allocations: 871
         📊 Histogram of allocation sizes: | █▂  |
         🥇 Biggest allocating functions:
                - feed:/home/linuxbrew/.linuxbrew/opt/python@3.13/lib/python3.13/xml/etree/ElementTree.py:1291 -> 245.1MiB
                - parse_cell:/home/monopoly/workspace/toys/sheetz/.venv/lib/python3.13/site-packages/openpyxl/worksheet/_reader.py:244 -> 63.0MiB
                - maybe_infer_to_datetimelike:/home/monopoly/workspace/toys/sheetz/.venv/lib/python3.13/site-packages/pandas/core/dtypes/cast.py:1198 -> 39.3MiB
                - _rows_to_cols:/home/monopoly/workspace/toys/sheetz/.venv/lib/python3.13/site-packages/pandas/io/parsers/python_parser.py:1066 -> 38.3MiB
                - _infer_types:/home/monopoly/workspace/toys/sheetz/.venv/lib/python3.13/site-packages/pandas/io/parsers/base_parser.py:720 -> 15.3MiB


Allocation results for tests/test_load.py::test_load_polars at the high watermark

         📦 Total memory allocated: 172.5MiB
         📏 Total allocations: 92
         📊 Histogram of allocation sizes: |█ |
         🥇 Biggest allocating functions:
                - collect:/home/monopoly/workspace/toys/sheetz/.venv/lib/python3.13/site-packages/polars/lazyframe/frame.py:2332 -> 96.0B


Allocation results for tests/test_load.py::test_load_pandas at the high watermark

         📦 Total memory allocated: 24.6MiB
         📏 Total allocations: 25
         📊 Histogram of allocation sizes: |█   ▁|
         🥇 Biggest allocating functions:
                - read:/home/monopoly/workspace/toys/sheetz/.venv/lib/python3.13/site-packages/pandas/io/parsers/c_parser_wrapper.py:234 -> 18.6MiB
                - __init__:/home/monopoly/workspace/toys/sheetz/.venv/lib/python3.13/site-packages/pandas/io/parsers/c_parser_wrapper.py:93 -> 6.0MiB
                - get_handle:/home/monopoly/workspace/toys/sheetz/.venv/lib/python3.13/site-packages/pandas/io/common.py:873 -> 4.0KiB
                - _clean_options:/home/monopoly/workspace/toys/sheetz/.venv/lib/python3.13/site-packages/pandas/io/parsers/readers.py:1688 -> 1.5KiB
                - read_csv:/home/monopoly/workspace/toys/sheetz/.venv/lib/python3.13/site-packages/pandas/io/parsers/readers.py:1009 -> 1.5KiB


Allocation results for tests/test_load.py::test_load_polars_partial_buffer at the high watermark

         📦 Total memory allocated: 31.5MiB
         📏 Total allocations: 16
         📊 Histogram of allocation sizes: |▁█   |
         🥇 Biggest allocating functions:
                - _check_empty:/home/monopoly/workspace/toys/sheetz/.venv/lib/python3.13/site-packages/polars/io/_utils.py:282 -> 1.3KiB
                - read_csv:/home/monopoly/workspace/toys/sheetz/.venv/lib/python3.13/site-packages/polars/io/csv/functions.py:572 -> 768.0B
                - read_csv:/home/monopoly/workspace/toys/sheetz/.venv/lib/python3.13/site-packages/polars/io/csv/functions.py:549 -> 728.0B

It appears that pandas beats polars in every aspect as far as memory efficiency. Is there something that I'm doing wrong? I'm trying to take advantage of LazyFrames with pl.scan_csv, but even that doesn't seem to help.


Solution

  • If the goal is just to preview the "first N rows": csv from stdlib and using openpyxl directly resulted in the least memory for me.

    With regards to the excel comparisons:

    Pandas read_excel uses openpyxl by default and it breaks early if nrows is set.

    if file_rows_needed is not None and len(data) >= file_rows_needed:
        break
    

    Polars read_excel used fastexcel as the default engine.

    It does have its own n_rows which you can set by using pl.read_excel(..., read_options={"n_rows": 20}) but it did not change memory usage in my testing.

    Polars also supports engine="openpyxl" but from what I can see it always reads all of the data:

    openpyxl iter_rows() does have a max_row= argument which used slightly less memory in my testing compared to the "early break".

    def test_openpyxl_partial_max_row(path_xlsx: str):
        default_kwargs = {"read_only": True, "data_only": True, "keep_links": False}
        
        wb = load_workbook(path_xlsx, **default_kwargs)
        
        for name in wb.sheetnames:
            ws = wb[name]
            for row in ws.iter_rows(max_row=21):
                text = [ cell.value for cell in row ]