csvazure-synapseexternal-tables

How to deal with columns containing commas while using comma as field terminator


I am creating an external table in Synapse serverless pool using data stored in a CSV file. I am using file format as DELIMITEDTEXT with Field Terminator as ",". Now one of my columns has values that have comma within themselves as well, so how do I deal with this ?

To specify, the column is called Description and contains message body from emails so there are commas present within the value.

I wasn't able to understand how to do this, as solutions recommend either enclosing the values within double quotes or using a different delimiter in the file itself. But I cannot change the source file.

Is it possible to create a data flow to replace the commas in a column value with spaces ?


Solution

  • When you have quote character in your csv file, you can define them while creating external file format.

     CREATE EXTERNAL FILE FORMAT [SynapseDelimitedTextFormat1]
     WITH (
      FORMAT_TYPE = DELIMITEDTEXT , 
      FORMAT_OPTIONS ( FIELD_TERMINATOR = ',',
      STRING_DELIMITER='"',
      USE_TYPE_DEFAULT = FALSE ))
    

    String_delimiter is used for defining the quote character.

    When you don't have Quote character in your file, then you need to use dataflow to remove the , from description column. Below is the dataflow approach.

    Sample data used in this dataflow:

    id,stockplace,stock,id_rep,description
    1,A-100,40,1,asw,asw
    1,B-100,10,1,aa
    1,C-100,5,1,aa
    2,A-300,90,2,s
    

    In this input data, we need to remove the comma in description.

    1. Take the source transformation and read the input data with different delimiter other than comma , . So that all columns are treated as single column.

    1. Take the derive transformation and extract the first few columns (except description) from the input data and output them as a new column called left_col.

    Expression for left_Col: substringIndex({id,stockplace,stock,id_rep,description},',' , 4)

    1. Take another derive transformation and replace commas with spaces. Expression: concat(left_col,',',replace(dropLeft({id,stockplace,stock,id_rep,description},length(left_col)+1),',',' '))

    This will remove the comma and replaces with space.

    1. Take the select transformation to select only the required columns. Remove all the columns that are computed and used temporarily. In this case, except column left_col all are selected. As left_col is created as temporary column to remove all the extra commas in the original column

    1. Take the parse transformation and parse the output of the select transformation as a delimited file with a comma as the column delimiter. Then take the sink transformation and output the data to a file.

    enter image description here