Consider the following polars dataframes:
>>> left = pl.DataFrame(pl.Series('a', [1,5,3,2]))
>>> left
shape: (4, 1)
┌─────┐
│ a │
│ --- │
│ i64 │
╞═════╡
│ 1 │
│ 5 │
│ 3 │
│ 2 │
└─────┘
>>> right = pl.DataFrame([pl.Series('a', [0,1,2,3]), pl.Series('b', [4,5,6,7])])
>>> right
shape: (4, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 0 ┆ 4 │
│ 1 ┆ 5 │
│ 2 ┆ 6 │
│ 3 ┆ 7 │
└─────┴─────┘
I would like to join the two in such a way that the order of the a
values from the left
dataframe is preserved. A left join seems to do this:
>>> left.join(right, on='a', how='left')
shape: (4, 2)
┌─────┬──────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪══════╡
│ 1 ┆ 5 │
│ 5 ┆ null │
│ 3 ┆ 7 │
│ 2 ┆ 6 │
└─────┴──────┘
My question is: is this behaviour guaranteed? If not, what would be the safe way to do this? I could use with_row_index
and then do a final sort, but that seems rather cumbersome. In pandas this can be done concisely with the reindex
method.
Update (November 2024):
Polars has decided to no longer guarantee preserving row order in left joins in the future. More background on this decision can be found in this GitHub issue. The bottom line: this guarantee may be expensive.
A new maintain_order
parameter will be added in the future which allows users to control this behavior. By default, order will not be preserved.
In order to future-proof your code now, you can use the workaround mentioned below of adding a row index and sorting on that.
Original post:
A left join guarantees preserving the order of the left dataframe, at least in the regular engine. For the streaming engine, this might not be guaranteed.
If you want to be 'safe', you already have the right workaround in mind to add a row index and sort on that.