apache-sparkazure-synapseexternal-tables

Synapse Analytics sql on-demand sync with spark pool is very slow to query


I have files loaded into an azure storage account gen2, and am using Azure Synapse Analytics to query them. Following the documentation here: https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/develop-storage-files-spark-tables, I should be able to create a spark sql table to query the partitioned data, and thus subsequently use the metadata from spark sql in my sql on demand query to given the line in the doc: When a table is partitioned in Spark, files in storage are organized by folders. Serverless SQL pool will use partition metadata and only target relevant folders and files for your query

My data is partitioned in ADLS gen2 as:

enter image description here

Running the query in a spark notebook in Synapse Analytics returns in just over 4 seconds, as it should given the partitioning: enter image description here

However, now running the same query in the sql on demand sql side script never completes: enter image description here

This result and extreme reduction in performance compared to spark pool is completely counter to what the documentation notes. Is there something I am missing in the query to make sql-on demand use the partitions?


Solution

  • Filepath() and filename() functions can be used in the WHERE clause to filter the files to be read. Which that you can achieve the pruning you have been looking for.