I am trying to improve the performance of a SSIS Package.
One thing I got startted with is to filter the reference table of the Lookups. Until now, I was using a table as a reference table for that lookup.
First improvment was to change the table to a SQL clause that is selecting just the columns I need from that table.
Next, I want to load in this table just the records I know I'll use for sure. If I'm maintaining it in this state, I will get to load 300 000 lines or more (huge lines with binary content of around 500 kb each) and use just around 100 of them.
I would put some filters in the SQL query that sets the reference table of the lookup, BUT, in that filter I need to use ALL the ids of the rows loaded in my OLE DB source.
Is there any way to do this?
I thought of loading each row at a time using a OleDB Command instead of a Lookup, but except of beeing time consuming, I might get to load the same thing 100 times for 100 different rows, when I could just load it once in the lookup and use it 100 times...
Enableing the cache still would be another option that still doesn't sound very good, because it would slow us down - we are already terribly slow.
Any ideeas are greatly appreaciated.
One possibility is to first stream the distinct IDs to a permanent/temporary table in one data flow and then use it in your lookup (with a join) in a later data flow (you probably have to defer validation).
In many of our ETL packages, we first stream the data into a Raw file, handling all the type conversions and everything on the way there. Then, when all these conversions were successful, then we handle creating new dimensions and then the facts linking to the dimensions.