apache-sparkpysparkgreat-expectations

How to get Great_Expectations to work with Spark Dataframes in Apache Spark ValueError: Unrecognized spark type: string


I have a Apache Spark dataframe which as a 'string' type field. However, Great_Expectations doesn't recognize the field type. I have imported the modules that I think are necessary, but not sure why Great_Expectations doesn't recognize the field

import great_expectations as ge
import great_expectations.dataset.sparkdf_dataset
from great_expectations.dataset.sparkdf_dataset import SparkDFDataset
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, BooleanType

The following code reads in the csv as a dataframe

test = spark.read.csv('abfss://root@adlspretbiukadlsdev.dfs.core.windows.net/RAW/LANDING/customers.csv', inferSchema=True, header=True)

The following shows the schema:

test.printSchema()
Command executed in 2 sec 64 ms by carlton on 1:53:28 PM, 6/17/21
root
 |-- first_name: string (nullable = true)

I think the following line of code creates Great_Expectation dataframe from the above Spark Dataframe

test2 = ge.dataset.SparkDFDataset(test)

I then code in the following expectation:

test2.expect_column_values_to_be_of_type(column='first_name', type_='string')

However, I get the following error:

ValueError: Unrecognized spark type: string
Traceback (most recent call last):

  File "/home/trusted-service-user/cluster-env/env/lib/python3.6/site-packages/great_expectations/data_asset/util.py", line 80, in f
    return self.mthd(obj, *args, **kwargs)

Not sure why Great_Expectations cannot recognize the Spark Type?


Solution

  • you need to do like this:

    INPUT:

    test2.expect_column_values_to_be_of_type(column='first_name', type_list["StringType"])
    

    OUTPUT: If it is StringType

    {
      "success": true,
      "meta": {},
      "result": {
        "observed_value": "StringType"
      },
      "exception_info": {
        "raised_exception": false,
        "exception_traceback": null,
        "exception_message": null
      }
    }