Re: Building fact tables in a data warehouse

From: jaromir nemec <jaromir_at_db-nemec.com>
Date: Sun, 23 Dec 2007 21:49:25 +0100
Message-ID: <02a501c845a5$4e327870$3e02a8c0@ADLA>

Hi Thomas,

> This worked OK for our ETL developers when we were dealing with 10,000 rows
> in the development database.
> Now we're working with a much larger source set and we're talking
> 100,000,000 rows. It doesn't work that well.
> Does anybody have experience or even a theoretical insight into a better way
> to do this?

One important thing in an ETL process is to know where is the bottleneck. From the birds view this can be

a) in the data source
b) in the transformation
c) in the data target

Except for some cases such as inefficient selects in data source, indexes on targets or contention with other processes in prevailing number of cases the problem lies in the transformation. This is due to the row-by-row processing logic of the majority of ETL tools. A typical answer is to use parallelism; in principle to start more ETL sessions working on partitioned sources and targets. This could indeed help, and with your row count will be probably required. But there is other possibility to shift some transformation logic in the data source.

> To build our fact table, let's call it Z (as I understand the process) we
> grab the primary key of the row in dimension A plus information which
> we can use to find the associated row in dimension B. We go to dimension B,
> grab the primary key plus information which allows us to
> find the associated row in dimension C, and so on through 8 dimension
> tables.

In this case I'd try to join the fact records with the dimensions in the database (e.g. using parallel, partition-wise hash join) possible with pre-denormalizing the dimension hierarchy in a single table before the ETL process starts (to avoid the cascading joins).

One starting point in troubles with ETL process is to rewrite the process (possible simplified) in a CTAS statement. This gives you a quick orientation where the problem could be. If the CTAS doesn't work as expected, it is probably an a) or c) problem and should be solved in the database. Otherwise you get estimation what can be an expected target throughput of the ETL transformation.

Regards,

Jaromir D.B. Nemec

--
http://www.freelists.org/webpage/oracle-l

Received on Sun Dec 23 2007 - 14:49:25 CST