pythondataframejoinmergepython-polars

Pandas merge functionality in Polars


I want to merge two Polars DataFrames (possibly on multiple columns) with the columns used for merging not being duplicated.

Current behavior:

import polars as pl

a = pl.DataFrame({'a': [1, 2, 3], 'b': [1, 2, 3]})
b = pl.DataFrame({'a': [3, 4, 5], 'b': [3, 4, 5]})

a.join(b, how='full', on='a')
shape: (5, 4)
┌──────┬──────┬─────────┬─────────┐
│ a    ┆ b    ┆ a_right ┆ b_right │
│ ---  ┆ ---  ┆ ---     ┆ ---     │
│ i64  ┆ i64  ┆ i64     ┆ i64     │
╞══════╪══════╪═════════╪═════════╡
│ 3    ┆ 3    ┆ 3       ┆ 3       │
│ null ┆ null ┆ 4       ┆ 4       │
│ null ┆ null ┆ 5       ┆ 5       │
│ 2    ┆ 2    ┆ null    ┆ null    │
│ 1    ┆ 1    ┆ null    ┆ null    │
└──────┴──────┴─────────┴─────────┘

What I want:

shape: (5, 3)
┌─────┬──────┬──────┐
│ a   ┆ b_x  ┆ b_y  │
│ --- ┆ ---  ┆ ---  │
│ i64 ┆ i64  ┆ i64  │
╞═════╪══════╪══════╡
│ 1   ┆ 1    ┆ null │
│ 2   ┆ 2    ┆ null │
│ 3   ┆ 3    ┆ 3    │
│ 4   ┆ null ┆ 4    │
│ 5   ┆ null ┆ 5    │
└─────┴──────┴──────┘

I want this behavior as I have large DataFrames which I want to merge on multiple columns. One way that might work is to just merge the resulting key columns somehow and then remove the duplicate columns but that seems cumbersome.


Solution

  • Setting the coalesce parameter of pl.DataFrame.join to True gives the expected dataframe.

    a.join(b, on="a", how="full", coalesce=True).sort("a")
    
    shape: (5, 3)
    ┌─────┬──────┬─────────┐
    │ a   ┆ b    ┆ b_right │
    │ --- ┆ ---  ┆ ---     │
    │ i64 ┆ i64  ┆ i64     │
    ╞═════╪══════╪═════════╡
    │ 1   ┆ 1    ┆ null    │
    │ 2   ┆ 2    ┆ null    │
    │ 3   ┆ 3    ┆ 3       │
    │ 4   ┆ null ┆ 4       │
    │ 5   ┆ null ┆ 5       │
    └─────┴──────┴─────────┘
    

    Note. The final sort("a") is only there to produce the expected ordering.