[SOLVED] Transactional data in data lake

Transactional data in data lake

We have multiple source systems sending data. Ideally we should capture the raw data coming from sources and keep it in data lake. Then we have to process the raw data into a structured format. Now users can update this data via a front end application.

I am thinking of putting a rdbms on top of processed data and then pull the audit trails from rdbms to data lake and merge processed data and audit trails to create the final view for reporting. Or the rdbms can also be used for analytics as well.

Or we can bring in all the data originally in rdbms and run the changes in rdbms and pull data from rdbms into data lake. But this doesn't make much sense to bring in data lake.

Kindly suggest.

Thanks,

Solution

ADLA is NOT consumer oriented, meaning you would not connect a front-end system to it. If the question is "what should we do", I'm not sure anyone can answer that for you, but it sounds like you are on the right track.

What I can do is tell you what we do:

Raw data (CSV or TXT files) come in to Blob Storage
U-SQL scripts extract that data and store it in Data Lake Analytics tables. [Blobs can be deleted at that point].
We output processed data as required to "consumable" sources like RDBMS. There are several ways to do this, but currently we output to pipe delimited text files in blob storage and use Polybase to import to SQL Server. YMMV.

Pulling the data into Data Lake first and RDBMS second makes sense to me.