azure-databricksazure-synapsedelta-lake

Azure synapse serverless pool couldn't read external delta table properly


I'm new to databricks and azure platform.

My company have some delta tables stored in the ADLS2 and we use databricks to update those tables (upsert or append) every day.

We also created external view in synapse serverless pool to link to those delta tables using the

select * from openrowset(bulk '<path of delta table>', datasource='<container>', format='delta') as r query.

But sometimes if we query the view in synapse, not all the data can be shown properly.

For example, we have a delta table in ADLS2 storing last 30 days transaction.

If I read the delta table in databricks, everything works fine.

But we query the external view that pointing to that delta table in synapse, sometimes it can read all the data, but sometimes some records are duplicate, also sometimes not all data is shown (maybe only last 5 days, last 10 days).

I did some research but still can't figure out the solution.

Thank you!

I've checked the connection of the view, the path in the openrowset function is correct.

I tried get the full set data and overwrite the table in databricks, the data in the view in synapse resumes normal. It seems the problems only appear when I use append or upsert.

I tried use optimize in databricks but it doesn't work.


Solution

  • Synapse OPENROWSET function in Synapse has it limitations in reading delta format. Consider creating an external table instead of a view:

    CREATE EXTERNAL TABLE {ExternalTable}(
    --Schema
    ) WITH (DATA_SOURCE = '{your-data-source}',
           LOCATION = '{path-to-delta-table}'
          ,FILE_FORMAT = DELTA)