I have a large polars dataframe that I need to unpivot
. This dataframe contains lots of null values (at least half). I want to drop the nulls while unpivoting the dataframe. I already tried to unpivot
the dataframe first and then filter it with drop_nulls()
or similar approaches. However, this is too memory-intensive (on a machine with about 1TB RAM).
Is there any way in which I can filter the dataset already during the process of unpivot?
Any help is appreciated!
Sample data.
# in reality, this dataset has about 160k rows and columns
# (square matrix), and is about 100GB
df = {
"A": [None, 2, 3],
"B": [None, None, 2],
"C": [None, None, None],
"names": ["A", "B", "C"]
}
df = pl.DataFrame(df)
df.unpivot(index = "names", variable_name = "names_2", value_name = "distance")
Output.
shape: (9, 3)
┌───────┬─────────┬──────────┐
│ names ┆ names_2 ┆ distance │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 │
╞═══════╪═════════╪══════════╡
│ A ┆ A ┆ null │
│ B ┆ A ┆ null │
│ C ┆ A ┆ null │
│ A ┆ B ┆ 2 │
│ B ┆ B ┆ null │
│ C ┆ B ┆ null │
│ A ┆ C ┆ 3 │
│ B ┆ C ┆ 2 │
│ C ┆ C ┆ null │
└───────┴─────────┴──────────┘
This could then be filtered (e.g. using df = df.drop_nulls()
), but I would like to get this desired result directly from the unpivot.
Expected output.
shape: (3, 3)
┌───────┬─────────┬──────────┐
│ names ┆ names_2 ┆ distance │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 │
╞═══════╪═════════╪══════════╡
│ A ┆ B ┆ 2 │
│ A ┆ C ┆ 3 │
│ B ┆ C ┆ 2 │
└───────┴─────────┴──────────┘
Very likely you can make the operation more efficient in runtime and memory consumption by using pl.LazyFrames
and polars' streaming engine.
By using pl.LazyFrames
, the melt
/ unpivot
and filter
/ drop_nulls
won't be operations eagerly, but first aggregated into a query plan. When collecting the lazy DataFrame (i.e. materialising it into a pl.DataFrame
), the query plan can be optimised, taking into account subsequent operations.
Streaming will enable the processing to not done all-at-once, but executed in batches, ensuring that the processed batches don't grow larger-than-memory.
(
df
# convert to pl.LazyFrame
.lazy()
# create query plan
.unpivot(
index="names",
variable_name="names_2",
value_name="distance"
)
.drop_nulls()
# collect pl.LazyFrame while using streaming engine
.collect(streaming=True)
)
Note. Tentative tests on my machine give large improvements in runtime and memory consumptions.