In a Synapse pipeline, I'm trying to use a CETAS script activity from a parquet file that I generate before (from an Azure SQL database). The source database has some tables containing LOB columns (xml type), so the default varchar(8000) type isn't sufficient for some values. I tried to define the columns and types:
CREATE EXTERNAL TABLE myTable ([column1] int, [column2] varchar(MAX))
WITH (
LOCATION = 'myLocation',
DATA_SOURCE = mySource,
FILE_FORMAT = Parquet
)
AS
SELECT * FROM OPENROWSET(BULK 'myParquetFile.parquet',
FORMAT='PARQUET') AS f;
My actual table has 24 columns (not just 2 as in the above script), but I get this message when running the script:
{
"errorCode": "2011",
"message": "Different number of columns in CREATE TABLE or CREATE EXTERNAL TABLE and SELECT query.",
"failureType": "UserError",
"target": "myScriptName",
"details": []
}
When checking with a parquet file viewer, it does contain 24 columns, exactly as the column definition I pass in the CETAS script. Is there a kind of "hidden" column in a parquet file? Should I replace the "SELECT *" by the exact list of the 24 columns?
thanks for your help!
I finally tweaked the source parquet file (step before the external table creation). The columns with LOB were containing xml code, that I did not necessarily require. I truncated them to remain within the 8000 chars limit (and use the CETAS statement without having to define the columns schemas) In the copy activity, I used the below statement within the SELECT of the source: LEFT(CONVERT(NVARCHAR(MAX),[myColumnToTruncate]),8000) AS [myColumnToTruncate] This is not perfect as the data is not kept entirely, but this fitted my requirement. Thanks to @RakeshGovindula and @Bhavani for taking the time to answer!