pythonpandasdataframegreat-expectations

Using Great Expectations with index of pandas data frame


If I have a a data frame

df = pd.DataFrame({'A': [1.1, 2.2, 3.3], 'B': [4.4, 5.5, 6.6]})

I can use Great Expectations to check the name and dtypes of the columns like so:

import great_expectations as ge

df_asset = ge.from_pandas(df)

# List of expectations
df_asset.expect_column_to_exist('A')
df_asset.expect_column_to_exist('B')
df_asset.expect_column_values_to_be_of_type('A', 'float')
df_asset.expect_column_values_to_be_of_type('B', 'float')

if df_asset.validate()["success"]:
    print("Validation passed")
else:
    print("Validation failed")

But how can I do a similar thing to check the index of the data frame? I.e. if the data frame was instead

df = pd.DataFrame({'A': [1.1, 2.2, 3.3], 'B': [4.4, 5.5, 6.6]}).set_index('A')

I am looking for something like

df_asset.expect_index_to_exist('idx')
df_asset.expect_index_values_to_be_of_type('idx', 'float')

to replace in the list of expectations


Solution

  • One quick hack is to use .reset_index to convert the index into a regular column:

    import great_expectations as ge
    
    df_asset = ge.from_pandas(df.reset_index())
    
    # List of expectations
    df_asset.expect_column_to_exist('A')
    df_asset.expect_column_to_exist('B')
    df_asset.expect_column_values_to_be_of_type('A', 'float')
    df_asset.expect_column_values_to_be_of_type('B', 'float')
    
    # index-related expectations
    df_asset.expect_column_to_exist('index')
    df_asset.expect_column_values_to_be_of_type('index', 'int')
    
    if df_asset.validate()["success"]:
        print("Validation passed")
    else:
        print("Validation failed")
    
    

    Note that the default name for an unnamed index is 'index', but you can also control it with kwarg names (make sure you have pandas>=1.5.0). Here is an example:

    df_asset = ge.from_pandas(df.reset_index(names='custom_index_name'))
    

    This could be useful when you want to avoid clashes with existing column names. This approach can also be used for multiple indexes by providing a tuple of custom names.