pythonapache-sparkgreat-expectations

Adding jars to the great_expectations' spark session


Setup:

I am specifically wondering how to add jars to the spark session's config that great_expectation uses when running the auto-profiler to create a test suite.

The process fails because I need to add the org.apache.hadoop:hadoop-azure:3.3.1 jar to the spark session in order for the spark job to be able to access & profile the data on ADLS.

Any help in how to do in the context of the great_expectations package is appreciated.

The error message:


Great Expectations will create a notebook, containing code cells that select from 
available columns in your dataset and generate expectations about them to demonstrate 
some examples of assertions you can make about your data.

When you run this notebook, Great Expectations will store these 
expectations in a new Expectation Suite "adls_test_suite_tmp" here:

  file://C:\Coding\...\great_expectations\expectations/adls_suite_tmp.json

Would you like to proceed? [Y/n]: Y

WARN FileStreamSink: Assume no metadata directory. 
    Error while looking for metadata directory in the path: 
    wasbs://<adls-container>@<adls-account>.blob.core.windows.net/test/myfile.csv

java.lang.RuntimeException: java.lang.ClassNotFoundException: 
    Class org.apache.hadoop.fs.azure.NativeAzureFileSystem$Secure not found


Solution

  • I semi-solved it by adding the jars to the spark-defaults.conf file, but I'm really unhappy with this dirty solution as any spark job started on the system will contain the jar packages now. If anyone has a better solution, please share.

    spark.jars.packages                 com.microsoft.azure:azure-storage:8.6.6,org.apache.hadoop:hadoop-azure:3.3.1