azureazure-data-factoryazure-databricksparquetazure-data-lake-gen2

Data Factory Parquet Incorrectly Ingestion Decimals


I am working on an Azure Data Factory pipeline and noticed that when I use a parquet sink to ADLS Gen 2, certain decimals are becoming truncated, and are returning results not consistent with the data source. From ADLS this data is being ingested into databricks for analytics, which was where the bug was initially noted.

Example: Original Data Source: 861.099901397946075 In Parquet: 86199901397946075 In DataBricks: 86.199901397946075

The datatype in the datasource is Decimal(35,15), and when saving the data as a parquet, it appears to remove the leading "0" in the decimal, causing the decimal portion of the number to be offset.

I have also noticed that this does not occur with every decimal entry I am ingesting, only the ones with leading zeros in that decimal spot.

Has anyone experienced this, and know of a fix? Thanks.

Tried - loading the data in via Parquet. Loading that data into to DataBricks. Expected consistent results with the data source. I have tried Parquet with no/different kinds of compression with no success.

I have used a CSV sink instead of parquet, and the data did populate correctly. I prefer parquet for my use case though.


Solution

  • It had to do with the decimal type Decimal(35,15). Parquet in ADF only supports up to Decimal(28, ), so the data type had to be altered in order to ingest properly