sql-serverazureazure-synapseazure-synapse-analytics

Is there a way to include partition paths of the file path as a column in Azure Synapse SQL?


I have written parquet files in a path like this on the blob storage container: https://<my_blob_storage_accnt>.dfs.core.windows.net/current-snapshot/businesspartner=P1, https://<my_blob_storage_accnt>.dfs.core.windows.net/current-snapshot/businesspartner=P2, etc.

When I read the parquet files in Synapse SQL using OPENROWSET, like below --

SELECT
    TOP 100 *
FROM
    OPENROWSET(
        BULK 'https://<my_blob_storage_accnt>.dfs.core.windows.net/current-snapshot/**',
        FORMAT = 'PARQUET'
    ) AS [result]

I realized that the output table does NOT have businesspartner column. When I was using Spark/PySpark, I was able to do something like below to (source) see businesspartner as a column:

val df = sparkSession.read
  .option("basePath", path)
  .parquet(path + "/businesspartner=*/*.parquet")

I looked around and found these resources (1, 2 and 3). The closest approaches I can see, which are not quite natural for a SQL person like myself, are this approach, which suggests using WHERE filepath() = 'P1'

or as this approach suggests, to explicitly list individual blob paths for each partition to filter them like below:

SELECT
    TOP 100 *
FROM
    OPENROWSET(
        BULK (
'https://<my_blob_storage_accnt>.dfs.core.windows.net/current-snapshot/businesspartner=P1',
'https://<my_blob_storage_accnt>.dfs.core.windows.net/current-snapshot/businesspartner=P2'
),
        FORMAT = 'PARQUET'
    ) AS [result]

Is there a parameter in OPENROWSET or some other approach that would concisely allow me to read the whole suite of partitions as a column in the result table, from which I can use WHERE to filter the businesspartners if I wish to?

Thank you in advance for your answers/suggestions!


Solution

  • Assuming you are using SQL Serverless then you can pass a parameter to filepath as below (as documented here)

    SELECT
        TOP 100 *,
                [result].filepath() AS WholePath,
                [result].filepath(1) AS Wildcard1,
                [result].filepath(2) AS Wildcard2
    FROM
        OPENROWSET(
            BULK 'https://<my_blob_storage_accnt>.dfs.core.windows.net/current-snapshot/*=*/.parquet',
            FORMAT = 'PARQUET'
        ) AS [result]
    WHERE [result].filepath(2) IN ('P1', 'P2')