apache-sparkpysparkdatabricksazure-databricksgreat-expectations

How to Save Great Expectations results to File From Apache Spark - With Data Docs


I have successfully created a Great_Expectation result and I would like to output the results of the expectation to an html file.

There are few links highlighting how show the results in human readable from using what is called 'Data Docs' https://docs.greatexpectations.io/en/latest/guides/tutorials/getting_started/set_up_data_docs.html#tutorials-getting-started-set-up-data-docs

But to be quite honest, the documentation is extremely hard to follow.

My expectation simply verifies the number of passengers from my dataset fall within 1 and 6. I would like help outputting the results to a folder using 'Data Docs' or however it is possible to output the data to a folder:

import great_expectations as ge
import great_expectations.dataset.sparkdf_dataset
from great_expectations.dataset.sparkdf_dataset import SparkDFDataset
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, BooleanType
from great_expectations.data_asset import DataAsset

from great_expectations.data_context.types.base import DataContextConfig, DatasourceConfig, FilesystemStoreBackendDefaults
from great_expectations.data_context import BaseDataContext
from great_expectations.data_context.types.resource_identifiers import ValidationResultIdentifier
from datetime import datetime
from great_expectations.data_context import BaseDataContext


df_taxi = spark.read.csv('abfss://root@adlspretbiukadlsdev.dfs.core.windows.net/RAW/LANDING/yellow_trip_data_sample_2019-01.csv', inferSchema=True, header=True)

taxi_rides = SparkDFDataset(df_taxi)

taxi_rides.expect_column_value_lengths_to_be_between(column='passenger_count', min_value=1, max_value=6)

taxi_rides.save_expectation_suite()

The code is run from Apache Spark.

If someone could just point me in the right direction, I will able to figure it out.


Solution

  • You can visualize Data Docs on Databricks - you just need to use correct renderer* combined with DefaultJinjaPageView that renders it into HTML, and its result could be shown with displayHTML. We need to import necessary classes/functions:

    import great_expectations as ge
    from great_expectations.profile.basic_dataset_profiler import BasicDatasetProfiler
    from great_expectations.dataset.sparkdf_dataset import SparkDFDataset
    from great_expectations.render.renderer import *
    from great_expectations.render.view import DefaultJinjaPageView
    

    To see result of profiling, we need to use ProfilingResultsPageRenderer:

    expectation_suite, validation_result = BasicDatasetProfiler.profile(SparkDFDataset(df))
    document_model = ProfilingResultsPageRenderer().render(validation_result)
    displayHTML(DefaultJinjaPageView().render(document_model))
    

    it will show something like this:

    enter image description here

    We can visualize results of validation with ValidationResultsPageRenderer:

    gdf = SparkDFDataset(df)
    gdf.expect_column_values_to_be_of_type("county", "StringType")
    gdf.expect_column_values_to_be_between("cases", 0, 1000)
    validation_result = gdf.validate()
    document_model = ValidationResultsPageRenderer().render(validation_result)
    displayHTML(DefaultJinjaPageView().render(document_model))
    

    it will show something like this:

    enter image description here

    Or we can render expectation suite itself with ExpectationSuitePageRenderer:

    gdf = SparkDFDataset(df)
    gdf.expect_column_values_to_be_of_type("county", "StringType")
    document_model = ExpectationSuitePageRenderer().render(gdf.get_expectation_suite())
    displayHTML(DefaultJinjaPageView().render(document_model))
    

    it will show something like this:

    enter image description here

    If you're not using Databricks, then you can render the data into HTML and store it as files stored somewhere

    * The correct renderer* documentation link above is technically "Legacy" now but still valid. The new docs site version lacks detail at the time of this writing.