sqlazureblobpolybase

using Polybase and Stored procedure for updating dbo.table from several external tables


I need some help in this.. I have 3 external tables:

create external table ext.titanic
(
PassengerId INT,
Pclass INT,
Pname VARCHAR(100),
Gender VARCHAR(20),
Ticket VARCHAR(30),
Cabin VARCHAR(30)
)
WITH (LOCATION='/titanic.csv',
    DATA_SOURCE = blob1,  
    FILE_FORMAT = TextFileFormat1,
);

create external table ext.titanic2
(
Pclass INT,
Pname VARCHAR(100)
)
WITH (LOCATION='/titanic2.csv',
    DATA_SOURCE = blob1,  
    FILE_FORMAT = TextFileFormat1,
);

create external table ext.titanic3
(
PassengerId INT,
Pname VARCHAR(100),
)
WITH (LOCATION='/titanic3.csv',
    DATA_SOURCE = blob1,  
    FILE_FORMAT = TextFileFormat1,
);

and i have dbo table created:

CREATE TABLE dbo.titanic
WITH
(
    DISTRIBUTION = ROUND_ROBIN
)
AS

SELECT 

titanic.PassengerId,
titanic.Pclass,
titanic.Pname,
titanic.Gender,
titanic.Ticket,
titanic.Cabin,
titanic3.PassengerId as T3_PassengerId,
titanic3.Pname as T3_Pname,
titanic2.Pclass as T2_Pclass,
titanic2.Pname as T2_Pname

FROM ext.titanic
FULL JOIN ext.titanic2 ON ext.titanic2.PassengerId=ext.titanic.PassengerId
FULL JOIN ext.titanic3 ON ext.titanic3.Pclass=ext.titanic.Pclass;

I have to join them and update the dbo.titanic with a stored procedure

do i need additional ext.table to join them there and after that to merge it with dbo.titanic?

or there is a easy and simple way to do that?

also i need more help for the dbo.titanic and joins..

there are more unique PassengerIds in titanic3 than in titanic, but i need all PassengerIds from the 2 tables to be in one column.. same for Pclass from both tables... that is bugging me

just for reference - titanic table has around 100000 rows(800 unique PassengerIDs) and titanic2 and titanic3 have 5000 unique (total)rows for PassengerId and Pclass.

The final table must look like dbo.titanic but without T3_PassengerId and T2_Pclass as they must be merged somehow in the PassengerId and Pclass.

I lost a lot of time looking for something like that, but didn't find anything close enough.

This is the best I could find:

https://www.sqlservercentral.com/articles/access-external-data-from-azure-synapse-analytics-using-polybase

and I want to thank the guy that wrote this, but to use it, I have 3 main issues :

  1. there are no 3 external tables with different columns that need to be joined
  2. there is no update so this can be used after the creation of the tables.(as I understand update cant be used with external tables)
  3. there is no stored procedure used for this update.

Can I use something like this

INSERT INTO table1(column1, column2,...) SELECT column1, column2,... FROM table2 WHERE condition( compare value in table1 <> value in table 2)

thanks in advance


Solution

  • I got this...

    may be not the most elegant way but it works.. using left join, with additional stg.titanic table (same as dbo.titanic) that combines the 3 external tables.. then merge stg. and dbo. tables..

    MERGE dbo.titanic AS [Target]
    USING (SELECT 
     column1,2,3
      UpdateTime
     from stg.titanic) AS [Source] 
     ON [Target].PassengerId = [Source].PassengerId 
     and [Target].Pclass = [Source].Pclass 
     and [Target].Pname = [Source].Pname --- specifies the condition
    WHEN MATCHED THEN
     UPDATE SET [Target].UpdateTime = GetDate()  
    WHEN NOT MATCHED THEN
    INSERT (column1,2,3 --- when one of the 3 conditions is not met then insert new row
     UpdateTime)
    VALUES (
      [Source].column1,2,3
      [Source].UpdateTime
    );
    

    if someone knows a better way it will be good to share with us

    Thanks.