I'm using an Azure Synapse Serverless SQL database.
I have a first copy activity that uses Azure SQL tables as sources and an Azure storage gen2, where I store .parquet files as sink. From these .parquet files, I use CETAS to create External tables in my Serverless SQL DB (for the context: I'm working with several Azure SQL databases, so these external tables will allow me to write cross db queries). In other words, this Serverless SQL DB is my ODS database.
I have then a second copy activity that identifies incremental changes in the source tables (using the SYS_CHANGE_VERSION of the corresponding CHANGETABLE of the source DB). This second copy activity also outputs .parquet files.
In the end, I have 2 parquet files: 1 with the full content of the source table + 1 with the content to insert or update. External tables in a Serverless SQL DB are only metadata, so no way to do DML operations on them, so my question would be: is there a way to "merge" my 2 parquet files into 1 single one (without duplicates of course) that I could use to recreate an updated external table?
Alternatively, I see that I can choose in the copy activity sink the copy method "Upsert" and provide a KeyColumn (PK of my table), but it doesn't work, saying that: "Message=INSERT operation is not allowed for this type of table." (which looks normal as the associated sink dataset is pointing on my external table, that is read only)
copy activity:
Any idea on how to solve this? thanks!
Since external tables in Azure Synapse Serverless SQL database are read-only, you cannot use the Upsert
copy method to update the external table directly.
**
in place of filename.Sample external table script:
CREATE EXTERNAL TABLE [dbo].[external_table] (
[PK] nvarchar(4000),
[name] nvarchar(4000),
[ingestion_time] nvarchar(4000)
)
WITH (
LOCATION = '<folder-name>/**',
DATA_SOURCE = <datasource-name>,
FILE_FORMAT = <fileformatname>
)
This script will make sure to combine all the data under that folder.
Example, File1 data:
PK,name,ingestion_time
1,Karikala,2023-05-01
4,Kalyani,2023-05-01
7,Sindhu,2023-05-01
File2 data:
PK,name,ingestion_time
1,Aadhi,2023-06-01
Here, File2 data has the updated record for PK=1. When the query select * from external_table
is executed, all four records will be displayed. Thus, Create a view on top of this external table to display only the latest record for each primary key.
Sample Query:
with cte as(
SELECT *,RANK() over (partition by PK order by [ingestion_time] desc) as Rank FROM [dbo].[external_table])
select PK,name,[ingestion_time] from cte where rank=1
PK | name | ingestion_time |
---|---|---|
1 | Aadhi | 2023-06-01 |
2 | Kalyani | 2023-05-01 |
3 | Sindhu | 2023-05-01 |
This query returns the rows with the highest ingestion_time
value for each PK
value in the external_table
.