I have a set of 1 million XML files, each of size ~14KB in Azure Blob Storage, mounted in Azure Databricks, and I am trying to use CREATE TABLE
, with the expectation of one record for each file.
The Experiment
The content structure of the files is depicted below. For simplicity and performance experimentation, all content of the files except the <ID>
element is kept identical.
<OBSERVATION>
<HEADER>...</HEADER>
<RESULT>
<ID>...</ID>
<VALUES>...</VALUES>
</RESULT>
</OBSERVATION>
For parsing/deserialization, I am using spark-xml by Databricks. At this moment, I am expecting records having two columns HEADER
and RESULT
, which is what I am getting.
CREATE TABLE Observations
USING XML
OPTIONS (
path "/mnt/blobstorage/records/*.xml",
rowTag "RESULT",
rootTag "OBSERVATION",
excludeAttribute True
)
The Problem
The CREATE TABLE
statement runs for 5.5 hours (a SQL query having name sql at SQLDriverLocal.scala:87
in the Spark UI) out of which only 1 hour is spent in Spark jobs (in the Jobs tab of the Spark UI).
I have noticed that the cell with the CREATE TABLE
command remains stuck at Listing files at "/mnt/blobstorage/records/*.xml"
for most of the time. First I thought it is a scaling problem in the storage connector. However, I can run the command on ~500K JSON files of similar size in ~25s (A problem with XML vs JSON?).
I also know that spark-xml
reads all the files to infer the schema, which might be the bottleneck. To eliminate this possibility, I tried to:
TEXT
provider).
The same problem persists in both cases.The same statement runs within 20s for 10K records, and in 30 mins for 200K records. With linear scaling (which is obviously not happening), 1 million records would have been done in ~33 minutes.
My Databricks cluster has 1 worker node and 3 driver nodes, each having 256 GB of RAM and 64 cores, so there should not be a caching bottleneck. I have successfully reproduced the issue in multiple runs over 4 days.
The Question
What am I doing wrong here? If there is some partitioning / clustering I can do during the CREATE TABLE
, how do I do it?
My guesss is that you are running into a small file problem as you are processing only 15 GB. I would merge the small files in bigger files each ca. 250 MB of size. As your dataset is still small you could do this on the driver. The following code shows this doing a merge on a driver node (without considering optimal filesize):
# copy files from mounted storage to driver local storage
dbutils.fs.cp("dbfs:/mnt/blobstorage/records/", "file:/databricks/driver/temp/records", recurse=True)
unzipdir= 'temp/records/'
gzipdir= 'temp/gzip/'
# generate shell-script and write it into the local filesystem
script = "cat " + unzipdir + "*.xml > " + gzipdir + """all.xml gzip """ + gzipdir + "all.xml"
dbutils.fs.put("file:/databricks/driver/scripts/makeone.sh", script, True)
%sh
sudo sh ./scripts/makeone.sh
dbutils.fs.mv("file:/databricks/driver/" + gzipdir, "dbfs:/mnt/mnt/blobstorage/recordsopt/", recurse=True)
Another important point is that the spark-xml library does a two step approach:
Finally I would recommend to store the data in parquet, so do the more sophisticated queries on a column based format then directly on the xmls and use the spark-xml lib for this preprocessing step.