apache-sparkapache-spark-sqldatabricksazure-databricksapache-spark-xml

Databricks Spark CREATE TABLE takes forever for 1 million small XML files


I have a set of 1 million XML files, each of size ~14KB in Azure Blob Storage, mounted in Azure Databricks, and I am trying to use CREATE TABLE, with the expectation of one record for each file.

The Experiment

The content structure of the files is depicted below. For simplicity and performance experimentation, all content of the files except the <ID> element is kept identical.

<OBSERVATION>
  <HEADER>...</HEADER>
  <RESULT>
    <ID>...</ID>
    <VALUES>...</VALUES>
  </RESULT>
</OBSERVATION>

For parsing/deserialization, I am using spark-xml by Databricks. At this moment, I am expecting records having two columns HEADER and RESULT, which is what I am getting.

CREATE TABLE Observations
USING XML
OPTIONS (
  path "/mnt/blobstorage/records/*.xml",
  rowTag "RESULT",
  rootTag "OBSERVATION",
  excludeAttribute True
)

The Problem

The CREATE TABLE statement runs for 5.5 hours (a SQL query having name sql at SQLDriverLocal.scala:87 in the Spark UI) out of which only 1 hour is spent in Spark jobs (in the Jobs tab of the Spark UI).

I have noticed that the cell with the CREATE TABLE command remains stuck at Listing files at "/mnt/blobstorage/records/*.xml" for most of the time. First I thought it is a scaling problem in the storage connector. However, I can run the command on ~500K JSON files of similar size in ~25s (A problem with XML vs JSON?).

I also know that spark-xml reads all the files to infer the schema, which might be the bottleneck. To eliminate this possibility, I tried to:

The same statement runs within 20s for 10K records, and in 30 mins for 200K records. With linear scaling (which is obviously not happening), 1 million records would have been done in ~33 minutes.

My Databricks cluster has 1 worker node and 3 driver nodes, each having 256 GB of RAM and 64 cores, so there should not be a caching bottleneck. I have successfully reproduced the issue in multiple runs over 4 days.

The Question

What am I doing wrong here? If there is some partitioning / clustering I can do during the CREATE TABLE, how do I do it?


Solution

  • My guesss is that you are running into a small file problem as you are processing only 15 GB. I would merge the small files in bigger files each ca. 250 MB of size. As your dataset is still small you could do this on the driver. The following code shows this doing a merge on a driver node (without considering optimal filesize):

    1. Copy the files from Blob to local file-system and generate a script for file merge:

    # copy files from mounted storage to driver local storage
    dbutils.fs.cp("dbfs:/mnt/blobstorage/records/", "file:/databricks/driver/temp/records", recurse=True)  
    
    unzipdir= 'temp/records/'
    gzipdir= 'temp/gzip/'
    
    # generate shell-script and write it into the local filesystem
    script = "cat " + unzipdir + "*.xml > " + gzipdir + """all.xml gzip """ + gzipdir + "all.xml"
    dbutils.fs.put("file:/databricks/driver/scripts/makeone.sh", script, True)
    

    2. Run the shell script

    %sh
    sudo sh ./scripts/makeone.sh
    

    3. Copy the files back to the mounted storage

    dbutils.fs.mv("file:/databricks/driver/" + gzipdir, "dbfs:/mnt/mnt/blobstorage/recordsopt/", recurse=True) 
    

    Another important point is that the spark-xml library does a two step approach:

    1. It parses the data to infer the schema. If the parameter samplingRatio is not changed, it does this for the whole dataset. Often it is enough only to do this for a smaller sample, or you can predefine the schema (use the parameter schema for this), then you don' t need this step.
    2. Reading the data.

    Finally I would recommend to store the data in parquet, so do the more sophisticated queries on a column based format then directly on the xmls and use the spark-xml lib for this preprocessing step.