pythonpython-polars

Difference between 2 Polars dataframes


What is the best way to find the differences between 2 Polars dataframes? The equals method tells me if there is a difference, I want to find where is the difference.

Example:

import polars as pl

df1 = pl.DataFrame([
    {'id': 1,'col1': ['a',None],'col2': ['x']},
    {'id': 2,'col1': ['b'],'col2': ['y', None]},
    {'id': 3,'col1': [None],'col2': ['z']}]
)

┌─────┬─────────────┬─────────────┐
│ id  ┆ col1        ┆ col2        │
│ --- ┆ ---         ┆ ---         │
│ i64 ┆ list[str]   ┆ list[str]   │
╞═════╪═════════════╪═════════════╡
│ 1   ┆ ["a", null] ┆ ["x"]       │
│ 2   ┆ ["b"]       ┆ ["y", null] │
│ 3   ┆ [null]      ┆ ["z"]       │
└─────┴─────────────┴─────────────┘


df2 = pl.DataFrame([
    {'id': 1,'col1': ['a'],'col2': ['x']},
    {'id': 2,'col1': ['b', None],'col2': ['y', None]},
    {'id': 3,'col1': [None],'col2': ['z']}]
)

┌─────┬─────────────┬─────────────┐
│ id  ┆ col1        ┆ col2        │
│ --- ┆ ---         ┆ ---         │
│ i64 ┆ list[str]   ┆ list[str]   │
╞═════╪═════════════╪═════════════╡
│ 1   ┆ ["a"]       ┆ ["x"]       │
│ 2   ┆ ["b", null] ┆ ["y", null] │
│ 3   ┆ [null]      ┆ ["z"]       │
└─────┴─────────────┴─────────────┘

The difference in the example is for id = 1 and id = 2.

I can join the dataframes:

df1.join(df2, on='id', suffix='_df2')
┌─────┬─────────────┬─────────────┬─────────────┬─────────────┐
│ id  ┆ col1        ┆ col2        ┆ col1_df2    ┆ col2_df2    │
│ --- ┆ ---         ┆ ---         ┆ ---         ┆ ---         │
│ i64 ┆ list[str]   ┆ list[str]   ┆ list[str]   ┆ list[str]   │
╞═════╪═════════════╪═════════════╪═════════════╪═════════════╡
│ 1   ┆ ["a", null] ┆ ["x"]       ┆ ["a"]       ┆ ["x"]       │
│ 2   ┆ ["b"]       ┆ ["y", null] ┆ ["b", null] ┆ ["y", null] │
│ 3   ┆ [null]      ┆ ["z"]       ┆ [null]      ┆ ["z"]       │
└─────┴─────────────┴─────────────┴─────────────┴─────────────┘

Expected result

I would like to either:

The example has only 2 columns, but there are more columns in the dataframe.


Solution

  • Here's the filter approach

    (
        df1.join(df2, on='id', suffix='_df2') 
        .filter(pl.any_horizontal(
            pl.col(x).ne_missing(pl.col(f"{x}_df2"))
                for x in df1.columns if x!='id'
            ))
    )
    

    If you wanted the bool column then you just change the filter to with_columns and add an alias.

    (
        df1.join(df2, on='id', suffix='_df2')
        .with_columns(
            has_diff = pl.any_horizontal(
                pl.col(x).ne_missing(pl.col(f"{x}_df2"))
                for x in df1.columns if x!='id'
                )
            )
    )
    

    This assumes that each df has all the same columns other than 'id'.