Thursday, March 29, 2012

ETL & BINARY_CHECKSUM?

We are currently populating our WH/DM using only tools availale in SQL2K
(i.e no Syncsort etc). Some of our modules (controlled via DTS) take,
understandably, a fair amount of time to processand I was ondering if there
was a trick was missing to speed things up. Example: -
In a data acquisition step for a Junk dimension (containing various state
flags) I extract CustomerURN and15 flags into a work table I then take
select the DISTINCT 15 flags into a table that will eventually be the
dimension for publication. I now need a cross-reference table so that other
processes can link on CustomerURN and pickup the dimension key. The obvious
SQL method is simply to join the tables on the 15 columns and O/P the xref
table - which takes a fair amount of time, but it does work.
Finally to the question, is there a quicker way of doing this? I have looked
at using BINARY_CHECKSUM but believe that this will not guarantee a unique
value that can be used for the match. Anybody got any ideas how to improve
on the method a using?
Thanks,
PaulHi Paul,
1. Checksum etc...A common problem.
We have released free utilities for win2000+ and one of them is a
utility to generate deltas given two versions of a file. You give it
yesterdays file and todays file (or just current version and previous
version) and it spits out Insert/Update/Delete records which must have
been applied to previous version to get current version. It is fully
null aware if you have nulls in your source data. We also publish the
source code for our free software and it is on the web site too. ( See
http://www.instantbi.com/Default.aspx?tabid=30 and the section on
Instant Data Warehouse utilities. Anyone is welcome to use our free
software...the utilities are things we have developed over many years
for our own projects so we think they are still useful...)
Checksums (CRCs) are frequently used but I choose not to use them
because there is still a chance to lose a change. And my feeling is
that even though it is 1 in 4 Billion I'd rather be sure.....;-)
2. Processing speeds.
Well depends on your tools...and how you design your jobs....we have
fee software in this area.....the biggest speed up you can get is
using files rather than using the database...
Files vs database is about 10 to 1. When we need to we use files as
much as possible for ETL. Life is much better now that machines are
faster but files are still the way to go if you need the
speed.......also for the really expensive processing we typically
load data into memory and use binary searches...in our fee software we
even do things like load data into memory mapped IO and have many
processes share the memory in the memory mapped IO...this gives us
great speed......
In our free software we do things like reformat the data so it can be
loaded via the database loaders...we also do things like run through
the input file and delete any rows that exist in the target table so
that the loader can be certain no constraint clashes will happen when
the loader runs....
These are all things that you have to write into DTS...I'm not sure
what is included in the next version of DTS...not keen to load 2005
betas onto my laptop.. ;-)
Hope this helps...
Peter Nolan
www.peternolan.com

No comments:

Post a Comment