databasereporting-servicesdata-warehouse

Strategies for populating a Reporting/Data Warehouse database


For our reporting application, we have a process that aggregates several databases into a single 'reporting' database on a nightly basis. The schema of the reporting database is quite different than that of the separate 'production' databases that we are aggregating so there is a good amount of business logic that goes into how the data is aggregated.

Right now this process is implemented by several stored procedures that run nightly. As we add more details to the reporting database the logic in the stored procedures keeps growing more fragile and unmanageable.

What are some other strategies that could be used to populate this reporting database?

I'm looking for any new ideas or additional thoughts on the above. Thanks!


Solution

  • Our general process is:

    1. Copy data from source table(s) into tables with exactly the same structure in a loading database
    2. Transform data into staging table, which have the same structure as the final fact/dimension tables
    3. Copy data from the staging tables to the fact/dimension tables

    SSIS is good for step 1, which is more or less a 1:1 copy process, with some basic data type mappings and string transformations.

    For step 2, we use a mix of stored procs, .NET and Python. Most of the logic is in procedures, with things like heavy parsing in external code. The major benefit of pure TSQL is that very often transformations depend on other data in the loading database, e.g. using mapping tables in a SQL JOIN is much faster than doing a row-by-row lookup process in an external script, even with caching. Admittedly, that's just my experience, and procedural processing might be better for syour data set.

    In a few cases we do have to do some complex parsing (of DNA sequences) and TSQL is just not a viable solution. So that's where we use external .NET or Python code to do the work. I suppose we could do it all in .NET procedures/functions and keep it in the database, but there are other external connections required, so a separate program makes sense.

    Step 3 is a series of INSERT... SELECT... statements: it's fast.

    So all in all, use the best tool for the job, and don't worry about mixing things up. An SSIS package - or packages - is a good way to link together stored procedures, executables and whatever else you need to do, so you can design, execute and log the whole load process in one place. If it's a huge process, you can use subpackages.

    I know what you mean about TSQL feeling awkward (actually, I find it more repetitive than anything else), but it is very, very fast for data-driven operations. So my feeling is, do data processing in TSQL and string processing or other complex operations in external code.