databricksdelta-lakedata-lakedata-lakehouse

How to handle CSV files in the Bronze layer without the extra layer


If my raw data is in CSV format and I would like to store it in the Bronze layer as Delta tables then I would end up with four layers like Raw+Bronze+Silver+Gold. Which approach should I consider?


Solution

  • A bit of an open question, however with respect to retaining the "raw" data in CSV I would normally recommend this as storage of these data is usually cheap relative to the utility of being able to re-process if there are problems or for purpose of data audit/traceability.

    I would normally take the approach of compressing the raw files after processing and perhaps tar-balling the files. In addition moving these files to colder/cheaper storage.