Setup:
great_expectations
package to test my data quality.InferredAssetAzureDataConnector
data_connector to create my data source (this works, I can see my files on the ADLS during creation).I am specifically wondering how to add jars to the spark session's config that great_expectation uses when running the auto-profiler to create a test suite.
The process fails because I need to add the org.apache.hadoop:hadoop-azure:3.3.1
jar to the spark session in order for the spark job to be able to access & profile the data on ADLS.
Any help in how to do in the context of the great_expectations package is appreciated.
The error message:
Great Expectations will create a notebook, containing code cells that select from
available columns in your dataset and generate expectations about them to demonstrate
some examples of assertions you can make about your data.
When you run this notebook, Great Expectations will store these
expectations in a new Expectation Suite "adls_test_suite_tmp" here:
file://C:\Coding\...\great_expectations\expectations/adls_suite_tmp.json
Would you like to proceed? [Y/n]: Y
WARN FileStreamSink: Assume no metadata directory.
Error while looking for metadata directory in the path:
wasbs://<adls-container>@<adls-account>.blob.core.windows.net/test/myfile.csv
java.lang.RuntimeException: java.lang.ClassNotFoundException:
Class org.apache.hadoop.fs.azure.NativeAzureFileSystem$Secure not found
I semi-solved it by adding the jars to the spark-defaults.conf
file, but I'm really unhappy with this dirty solution as any spark job started on the system will contain the jar packages now. If anyone has a better solution, please share.
spark.jars.packages com.microsoft.azure:azure-storage:8.6.6,org.apache.hadoop:hadoop-azure:3.3.1