I am using ADLS Gen2, from a Databricks notebook trying to process the file using 'abfss' path. I am able to read parquet files just fine but when I try to load the XML files, I am getting the error the configuration is not found - Configuration property xxx.dfs.core.windows.net not found.
I haven't tried mounting the file but trying to understand if it's a known limitation with XML files, as I am able to read the parquet files just fine.
Here is my XML libraries config com.databricks:spark-xml_2.11:0.9.0
I tried a couple of things per the other articles but still getting the same error.
df = spark.read.format("xml")
.option("rootTag","BookArticle")
.option("inferSchema", "true")
.option("error_bad_lines",True)
.option("mode", "DROPMALFORMED")
.load(abfsssourcename) ##abfsssourcename is the path of the source file name
Exception Details: Py4JJavaError: An error occurred while calling o1113.load.
Configuration property xxxx.dfs.core.windows.net not found. at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AbfsConfiguration.getStorageAccountKey(AbfsConfiguration.java:392) at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.initializeClient(AzureBlobFileSystemStore.java:1008) at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.<init>(AzureBlobFileSystemStore.java:151) at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.initialize(AzureBlobFileSystem.java:106) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:500) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:469) at org.apache.spark.SparkContext$$anonfun$newAPIHadoopFile$2.apply(SparkContext.scala:1281) at org.apache.spark.SparkContext$$anonfun$newAPIHadoopFile$2.apply(SparkContext.scala:1269) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.SparkContext.withScope(SparkContext.scala:820) at org.apache.spark.SparkContext.newAPIHadoopFile(SparkContext.scala:1269) at com.databricks.spark.xml.util.XmlFile$.withCharset(XmlFile.scala:46) at com.databricks.spark.xml.DefaultSource$$anonfun$createRelation$1.apply(DefaultSource.scala:71) at com.databricks.spark.xml.DefaultSource$$anonfun$createRelation$1.apply(DefaultSource.scala:71) at com.databricks.spark.xml.XmlRelation$$anonfun$1.apply(XmlRelation.scala:43) at com.databricks.spark.xml.XmlRelation$$anonfun$1.apply(XmlRelation.scala:42) at scala.Option.getOrElse(Option.scala:121) at com.databricks.spark.xml.XmlRelation.<init>(XmlRelation.scala:41) at com.databricks.spark.xml.XmlRelation$.apply(XmlRelation.scala:29) at com.databricks.spark.xml.DefaultSource.createRelation(DefaultSource.scala:74) at com.databricks.spark.xml.DefaultSource.createRelation(DefaultSource.scala:52) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:311) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:297) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:214) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
I summarize the solution as below.
The package com.databricks:spark-xml
seems using RDD API to read xml file. When we use using the RDD API to access Azure Data Lake Storage Gen2, wecannot access Hadoop configuration options set using spark.conf.set(...)
. So we should update the code as spark._jsc.hadoopConfiguration().set("fs.azure.account.key.xxxxx.dfs.core.windows.net", "xxxx==")
. For more details, please refer to here.
Besides, you aslo can mount Azure Data Lake Storage Gen2 as file system in Azure databricks.