azure-data-factoryparquetazure-synapseexternal-tables

Synapse CETAS from parquet file with columns definition is failing


In a Synapse pipeline, I'm trying to use a CETAS script activity from a parquet file that I generate before (from an Azure SQL database). The source database has some tables containing LOB columns (xml type), so the default varchar(8000) type isn't sufficient for some values. I tried to define the columns and types:

CREATE EXTERNAL TABLE myTable ([column1] int, [column2] varchar(MAX))
WITH (
        LOCATION = 'myLocation',
        DATA_SOURCE = mySource,
        FILE_FORMAT = Parquet
) 
AS
SELECT * FROM OPENROWSET(BULK 'myParquetFile.parquet',
    FORMAT='PARQUET') AS f;

My actual table has 24 columns (not just 2 as in the above script), but I get this message when running the script:

{
    "errorCode": "2011",
    "message": "Different number of columns in CREATE TABLE or CREATE EXTERNAL TABLE and SELECT query.",
    "failureType": "UserError",
    "target": "myScriptName",
    "details": []
}

When checking with a parquet file viewer, it does contain 24 columns, exactly as the column definition I pass in the CETAS script. Is there a kind of "hidden" column in a parquet file? Should I replace the "SELECT *" by the exact list of the 24 columns?

thanks for your help!


Solution

  • I finally tweaked the source parquet file (step before the external table creation). The columns with LOB were containing xml code, that I did not necessarily require. I truncated them to remain within the 8000 chars limit (and use the CETAS statement without having to define the columns schemas) In the copy activity, I used the below statement within the SELECT of the source: LEFT(CONVERT(NVARCHAR(MAX),[myColumnToTruncate]),8000) AS [myColumnToTruncate] This is not perfect as the data is not kept entirely, but this fitted my requirement. Thanks to @RakeshGovindula and @Bhavani for taking the time to answer!