I am using pl.testing.assert_frame_equal
to compare two pl.DataFrame
s. The assertion fails. The traceback indicates that there are exact value mismatches
in a certain column.
The column in question is of type bool
. It also contains null
values. This column has more than 20,000 rows and I need to figure out, where exactly the difference is.
What I did is to create a mask
that shows a true
value whenever there is a difference between the actual
dataframe and the expectation
dataframe.
mask = actual != expectation
What I then noticed is that the mask only contains false
and null
values in every column.
mask.sum().sum_horizontal()
gives 0
.
That means this is apparently not a good way to identify the rows with differences.
In my large dataframe I expect a situation like the following:
import polars as pl
from polars.testing import assert_frame_equal
df1 = pl.DataFrame(
{
"group": ["A", "A", "A", "B", "B"],
"value": [True, False, None, False, None]
}
)
df2 = pl.DataFrame(
{
"group": ["A", "A", "A", "B", "B"],
"value": [True, False, False, False, None]
}
)
Performing assert_frame_equal(df1, df2)
will correctly result in an AssertionError
.
AssertionError: DataFrames are different (value mismatch for column 'value')
[left]: [True, False, None, False, None]
[right]: [True, False, False, False, None]
The inequality test doesn't help in order to identify where the differences is as there are no true
values.
df1 != df2
shape: (5, 2)
┌───────┬───────┐
│ group ┆ value │
│ --- ┆ --- │
│ bool ┆ bool │
╞═══════╪═══════╡
│ false ┆ false │
│ false ┆ false │
│ false ┆ null │
│ false ┆ false │
│ false ┆ null │
└───────┴───────┘
If you look at the implementation:
It is using .ne_missing()
to compare the values.
df1.select(df1[col].ne_missing(df2[col]) for col in df1.columns)
shape: (5, 2)
┌───────┬───────┐
│ group ┆ value │
│ --- ┆ --- │
│ bool ┆ bool │
╞═══════╪═══════╡
│ false ┆ false │
│ false ┆ false │
│ false ┆ true │
│ false ┆ false │
│ false ┆ false │
└───────┴───────┘
(As well as the schema / dtype validation, etc.)