I'm working on a project where I need to extract data from an EDI file and load it into a database. I'm using Cozyroc’s “EDI Source” component for this.
The EDI Source produces three data outputs, and I insert each output into a different table. Additionally, after loading the data, I need to run some queries and validations using these tables.
I'm facing two main issues:
Insertion order: When inserting data from the three outputs into three different tables, the inserts happen in parallel. This becomes a problem because the third table has a foreign key that depends on the data already being present in the second table. Since SSIS processes these in parallel within the same Data Flow Task, I can't control the order to ensure the second table is loaded before the third.
Validation order: In the next step, I need to read from one of these tables and perform multiple validations. If I use a single Data Flow Task with a source pulling data from the table and then use a Multicast to perform several validations in parallel, I lose control over the order. Ideally, I'd like to execute the first validation, then the second, and so on, using the same dataset without querying the table multiple times.
My questions:
Is there a way to enforce a specific order of execution for inserts or validations within a single Data Flow Task, so that one flow finishes before the next starts?
Or is the only solution to split this into multiple Data Flow Tasks in the Control Flow and connect them with precedence constraints?
I'd like to avoid extracting or querying the same data multiple times just to control the sequence.
Use a control flow. SSIS does use internal buffers, but it is not reading the whole source data set into memory and then pushing it to destinations, otherwise large data transfers would consume enormous amounts of memory and/or spill to disk. It's better to think of SSIS data flows more like streams of data.
If your source data set is relatively small you could read the data into an object variable in SSIS itself, and then write from that object. But if the source data set is relatively small you might as well just read it repeatedly.
Alternatively, if you haven't put a lot of work into various kinds of transforms within the SSIS dataflow, it might be better to treat SSIS as a "dumb pipe", without much logic, which just gets the original data set into a staging table in the destination. Then perform the transform from the staged data into your three final destination tables locally on the destination database (eg, by writing a stored procedure and calling it from SSIS as an execute SQL task).
Another approach people sometimes use in data warehousing scenarios is to disable foreign keys while the data is loaded, and then re-enable them afterwards. But of course, this is not free. When the keys are re-enabled the database engine has to read the data required to validate them. On the other hand, in a data warehouse scenario foreign keys aren't actually all that important in the first place - the source system is responsible for maintaining referential integrity. If the data written to your destination is immutable, you could drop your foreign key.