Recently I'm discovering the data lake world, I'm planning on setting up a data lake with ADL. One of the things I'm not sure on is how a data lake is supposed to track changes over time/handle different version from a source.
I've come across site that claim that a data lake serves data as is, other state that data should be timestamped or the folder structure should reflect a timestamp.
Anyway, any best practices?
Cheers!
Often there are different zones in a data lake. Here is a good explanation of common zones. In the Raw zone, data is typically unchanged from source. It might be incremental loads of what records changed since the last load, or it could be a full copy of the data source entity. That is typically where you will see timestamped folders for each entity. As an example, you might have the following folder structure.
Raw Data
Organizational Unit
Subject Area
Original Data Source
Object
Date Loaded
File(s)
Users typically don't query the Raw zone. It is acting as a historical archive of the data.
Users will often query Curated zone. This zone usually contains a subset of data from Raw that has been transformed to meet user needs. Often, this contains a copy of what the entity currently looks like, omitting older versions because that is what analysts/data scientists want to see, or because that is what needs to feed into another application that sources data from the data lake. You can find a good explanation of Raw and Curated zones here.
So it's possible that you would have both timestamped data where you track changes as well as a current snapshot. What you have probably read is that a data lake should allow you to be able to recreate what an entity looked like at a specific time, and that can be accomplished in Raw. But other zones cater to your organization's data needs, whether that is current, all history, or snapshots as of specific dates.