pythonfilternullpython-polars

How can I remove nulls in the process of melting a polars dataframe?


I have a large polars dataframe that I need to melt. This dataframe contains lots of null values (at least half). I want to drop the nulls while melting the dataframe. I already tried to melt the dataframe first and then filter it with drop_nulls() or similar approaches. However, this is too memory-intensive (on a machine with about 1TB RAM).

Is there any way in which I can filter the dataset already during the process of melting?

Any help is appreciated!

Sample data.

# in reality, this dataset has about 160k rows and columns 
# (square matrix), and is about 100GB
df = {
    "A": [None, 2, 3],
    "B": [None, None, 2],
    "C": [None, None, None], 
    "names": ["A", "B", "C"]
}
df = pl.DataFrame(df)

df.melt(id_vars = "names", variable_name = "names_2", value_name = "distance")

Output.

shape: (9, 3)
┌───────┬─────────┬──────────┐
│ names ┆ names_2 ┆ distance │
│ ---   ┆ ---     ┆ ---      │
│ str   ┆ str     ┆ i64      │
╞═══════╪═════════╪══════════╡
│ A     ┆ A       ┆ null     │
│ B     ┆ A       ┆ null     │
│ C     ┆ A       ┆ null     │
│ A     ┆ B       ┆ 2        │
│ B     ┆ B       ┆ null     │
│ C     ┆ B       ┆ null     │
│ A     ┆ C       ┆ 3        │
│ B     ┆ C       ┆ 2        │
│ C     ┆ C       ┆ null     │
└───────┴─────────┴──────────┘

This could then be filtered (e.g. using df = df.drop_nulls()), but I would like to get this desired result directly from the melt.

Expected output.

shape: (3, 3)
┌───────┬─────────┬──────────┐
│ names ┆ names_2 ┆ distance │
│ ---   ┆ ---     ┆ ---      │
│ str   ┆ str     ┆ i64      │
╞═══════╪═════════╪══════════╡
│ A     ┆ B       ┆ 2        │
│ A     ┆ C       ┆ 3        │
│ B     ┆ C       ┆ 2        │
└───────┴─────────┴──────────┘

Solution

  • Very likely you can make the operation more efficient in runtime and memory consumption by using pl.LazyFrames and polars' streaming engine.

    By using pl.LazyFrames, the melt / unpivot and filter / drop_nulls won't be operations eagerly, but first aggregated into a query plan. When collecting the lazy DataFrame (i.e. materialising it into a pl.DataFrame), the query plan can be optimised, taking into account subsequent operations.

    Streaming will enable the processing to not done all-at-once, but executed in batches, ensuring that the processed batches don't grow larger-than-memory.

    (
        df
        # convert to pl.LazyFrame
        .lazy()
        # create query plan
        .melt(
            id_vars="names",
            variable_name="names_2",
            value_name="distance"
        )
        .drop_nulls()
        # collect pl.LazyFrame while using streaming engine
        .collect(streaming=True)
    )
    

    Note. Tentative tests on my machine give large improvements in runtime and memory consumptions.