pandasgreat-expectationsdata-quality

create a custom expectation in Great Expectations to validate multiple unique observations based on a given key in a DataFrame


Regarding Great Expectations I want to create a custom expectation to validate if there are multiple unique observations of id_client based on a given id_product key in a DataFrame.

After set up my Great Expectations project, I'm having trouble figuring out how to define and implement a custom expectation for this specific validation.

Here is a Data Sample:

import pandas as pd

# Create a sample DataFrame
df = pd.DataFrame({
    'id_product': [1, 1, 2, 2, 2, 3, 3],
    'id_client': [101, 102, 201, 202, 203, 301, 301]
})

This is the validation I can do in pandas but not in great expectations:

def count_unique_rows(df, id_column, other_column):
    unique_rows = df.groupby([id_column, other_column]).size().reset_index()
    count = unique_rows.groupby(id_column).size().reset_index(name='count')
    return count


assert any(count_unique_rows(df, 'id'_product, 'id_client')['count'] > 1)

Basically I want to study if there is any data inconsistence by setting up a condition


Solution

  • You could add a custom excpectation as this one :

    import great_expectations as gx
    from great_expectations.dataset import (
        PandasDataset,
        MetaPandasDataset,
    )
    
    class MyCustomPandasDataset(PandasDataset):
    
        _data_asset_type = "MyCustomPandasDataset"
    
        @MetaPandasDataset.column_map_expectation
        def expect_unique_pair(self, column):
            is_pair_unique_df=(self.groupby(['id_product', 'id_client']).size().to_frame('size') > 1).reset_index()
            return pd.merge(self, is_pair_unique_df, on=['id_product', 'id_client'], how="left")["size"]
    
    my_validated_df = gx.from_pandas(df, dataset_class=MyCustomPandasDataset)
    print(my_validated_df.expect_unique_pair('id_client'))
    

    The expect_unique_pair method will check against the given customPandasDataset for uniqueness of the key [id_product, id_client]. It returns a series of boolean wether the pair is unique or not.