Thursday, March 29, 2012

ETL Delta Pulling Huge data.. Right approach ?

Hi all,

In an approach of building an ETL tool, we are into a situation wherein, a table has to be loaded on an incremental basis. The first run all the records apporx 100 lacs has to be loaded. From the next run, only the records that got updated since the last run of the package or newly added are to be pulled from the source Database. One idea we had was to have two OLE DB Source components, in one get those records that got updated or was added newly, since we have upddate cols in the DB getting them is fairly simple, in the next OLEDB source load all the records form the Destination, pass it onto a Merge Join then have a Conditional Split down the piple line, and handle the updates cum insert.

Now the question is, how slow the show is gonna be ? Will there be a case that the Source DB returns records pretty fast and Merge Join fails in anticipation of all the records from the destination ?

What might be the ideal way to go about my scenario.. Please advice...

Thanks in advance.

Or may be just use a lookup component to check if there are any updates and redirect the errored rows as New records to the destination. Sounds good ?

|||

MShetty wrote:

Or may be just use a lookup component to check if there are any updates and redirect the errored rows as New records to the destination. Sounds good ?

The lookup component might be the way to go; or at least it has been the most popular approach for most scenarios. However, there has been several interesting discussion around this issue.

If you decide to use lookup transforms; be sure to understand how that component uses memory resources. Also, be aware that lookup transform is case sensitive when 'matching' the columns in the pipeline against the lookup table.

Jamie has interesting article on his blog; also Greg has written about an alternative approach using script component:

http://blogs.conchango.com/jamiethomson/archive/2006/09/12/SSIS_3A00_-Checking-if-a-row-exists-and-if-it-does_2C00_-has-it-changed.aspx

Using 2 OLE DB source components and merge join, i think would be slower; but you can test it and gain a better understanding of that.

|||

Rafael Salas wrote:

The lookup component might be the way to go; or at least it has been the most popular approach for most scenarios. However, there has been several interesting discussion around this issue.

If you decide to use lookup transforms; be sure to understand how that component uses memory resources. Also, be aware that lookup transform is case sensitive when 'matching' the columns in the pipeline against the lookup table.

looks the memory usage is going to be an issue in lookup, especially when the record count is pretty huge.

|||

May be is not an issue at all; my advise on that: test it!

make sure you provide a query in the lookup transform with only the columns to be used for the lookup operation, so the resultset is as smallest as possible columns wise.

|||

Rafael Salas wrote:

May be is not an issue at all; my advise on that: test it!

make sure you provide a query in the lookup transform with only the columns to be used for the lookup operation, so the resultset is as smallest as possible columns wise.

I strongly agree. Lookup transform performs very well if you can use it right.

|||I did a testing with about 100 000+ rows in my forum post Merge Join vs. Lookup vs. Custom Script.

This should give you an overview - but please do your own testing before going live as the performance may vary accross configurations|||Thx for all the valuable inputs.. I think Lookup is not a fear, but its the components down the pipeline, OLEDB Command I am worried about,if its going to update lot of rows its going to hit the perfomance big time Sad|||

TheViewMaster wrote:

I did a testing with about 100 000+ rows in my forum post Merge Join vs. Lookup vs. Custom Script.

This should give you an overview - but please do your own testing before going live as the performance may vary accross configurations

Amen to that. its no good asking on here - the only person that can provide a definitive answer is yourself.

-Jamie

|||

MShetty wrote:

Thx for all the valuable inputs.. I think Lookup is not a fear, but its the components down the pipeline, OLEDB Command I am worried about,if its going to update lot of rows its going to hit the perfomance big time

Yup - updating large amount of rows can take forever.
I also came to conclusion that with large number records - OLE DB Command object won't perform well. The solution I use is to load the data which needs to be updated to staging table - and then Execute SQL to update
e.g.
UPDATE Prod_Table
SET Field = a2.Field
FROM Prod_Table a1
JOIN Staging_Table a2 ON a1.PrimaryKey = a2.PrimaryKey|||

TheViewMaster wrote:

MShetty wrote:

Thx for all the valuable inputs.. I think Lookup is not a fear, but its the components down the pipeline, OLEDB Command I am worried about,if its going to update lot of rows its going to hit the perfomance big time

Yup - updating large amount of rows can take forever.
I also came to conclusion that with large number records - OLE DB Command object won't perform well. The solution I use is to load the data which needs to be updated to staging table - and then Execute SQL to update
e.g.
UPDATE Prod_Table
SET Field = a2.Field
FROM Prod_Table a1
JOIN Staging_Table a2 ON a1.PrimaryKey = a2.PrimaryKey

The approach ViewMaster is describing is exactly what I do on such cases. OLE DB commands are simply slow by nature, command gets executed once per every row, so its implementation is just not adecuate for large number of rows. You can use a lookup to separete your inserts from updates and then the insert pipe can go directly to the destination table and the other one to a stage table to be used back in the control flow in a Execute sql task (1 time update).

|||Thx for the inputs... Will defintely post on what was the test results once I test out the packages Smile

No comments:

Post a Comment