Streamlining Oracle 9i ETL With Pipelined Table Functions
I've recently been reading Bert Scalzo's "Oracle DBAs Guide To Data Warehousing And Star Schemas", and he's got an interesting take on the ETL process that I thought it worth looking at in more detail.
Background
Bert was the DBA behind the 7-Eleven data warehouse, one of the first multi-terabyte dimensional data warehouses built on the Oracle database. The book is a bit different to other Oracle data warehousing books in that it concentrates on the implementation details of building a warehouse, and as it's not by an Oracle member of staff he tends to have forthright opinions on what elements of the technology stack work, and what don't. A particular example of this is where he looks at the various releases of the Oracle database (7, 8i and 9i) and explains in some detail why 7 and 8i don't really cut the mustard as large-scale data warehousing platforms. In another instance, he looks at the evolution of the star transformation, explains why the original implementations (which were first implemented using nested loops, and then after by cartesian joins) weren't up to scratch but then absolutely insists that you use the version that comes with Oracle 9i. As the author's not afraid to offer an opinion it's an entertaining read, and a bit different to your run of the mill 'new features' book.
One chapter that particularly interested me was his approach to loading the data warehouse. Right from the outset, Bert eschews the use of ETL tools, preferring instead to code the load routines manually. In his opinion, the trouble with ETL tools is:
- The code they produce is not optimally efficient, and often uses slow-running Java to host and run the processes
- The cost of the tools is often more than the hours they replace
- They allow the project to be staffed by people with little Oracle or data warehousing experience
This is a theme echoed in an Oracleworld paper by Arup Nanda that I looked at last year, which again rejected the use of ETL tools, but this time because the nature of the ETL (lots of small transformations to a large number of nearly identical tables) was better served by running a number of Unix shell scripts that in turn generated SQL scripts to a pattern. I'll come back to the ETL vs. hand coding debate later on in the article.
Anyway, the ETL process that the author wanted to carry out involved the following steps
- load data from a large flat file
- apply transformations to the data
- update/insert the data into a dimension table
What made this interesting was the fact that the file was very, very big, and the ETL window was small. Therefore, any ETL process had to
- Minimise inter-process wait states
- Maximise total concurrent CPU usage
To illustrate, take a situation where your ETL process starts with a file of 1000 records, and has two stages, each of which take an hour to run. The least efficient way to process this file is to let the first process handle all 1000 records (taking an hour to do so), then pass the data on to the second process, which will also take an hour to run. Your total run time is two hours. What the author is suggesting with 'minimise inter-process wait states' is that it would be good if your first process could handle for example 10 records, then pass these on to the second process so that it could start it's work. The first process can then get on with the next 10 records, and the pass them on when it's done. Your total run time would then be just a bit over one hour, cutting down dramatically the time to perform the load.
The 'maximise total concurrent CPU usage' requirement aims to take advantage of the ability for multi-processor servers to carry out more than one task at once. Oracle is very good at this, and has had this feature (known as parallelism) for many years. However, it only really comes in to play if you use DML statements (insert, update etc) to do your load; if you write your load in PL/SQL using cursors and loops to process your data row-by-row, you're not going to be able to take advantage of parallel DML. What these two requirements mean, therefore, is that your ideal ETL process is going to use parallel processing, and somehow reduce to the minimum the wait between different ETL stages.
Introducing External Tables
One of the key new features in Oracle 9i for business intelligence and data warehousing was the inclusion of a number of ETL features within the database, the point of which was to remove the requirement to purchase a separate ETL engine such as Informatica, Genio or Datastage. The approach proposed by the author uses these new features to shortcut the warehouse load process.
If you remember from earlier in the article, our load process had to carry out three steps:
- load data from a large flat file
- apply transformations to the data
- update/insert the data into a dimension table
Traditionally, you would accomplish this first step by using SQL*Loader. SQL*Loader allows you to import a flat file into the database, carry out basic transformations on it (change case, reformat, reject invalid rows), and if it helps speed up the process, load it in parallel using direct path. As you can only insert data using SQL*Loader, and the degree to which transformations can take place is limited, you'd generally load the data into a staging table and then process it in a second step.
Oracle 9i however introduced a new feature called External Tables, which allows you to define a database table over a flat file. Taking as our example a comma-separated contracts file that is used to load data into a contracts dimension, the code to create an external table would be:
create directory inp_dir as '/home/oracle/input_files'; create table contracts_file (contract_id number, descrip varchar2(50), init_val_loc_curr number, init_val_adj_amt number) organization external ( type oracle_loader default directory inp_dir access parameters ( fields terminated by ',' ) location ('contracts_file.csv') ) parallel 10;
The External Table feature allows you to embed the SQL*Loader control file into the table DLL script, and then allows you to run SELECT statements against the flat file. You can include the external table in joins, subqueries and so on, but you can't use the external table to delete or update data in the flat file. External tables came in with Oracle 9i, and they're available as data sources in recent versions of Oracle Warehouse Builder.
(Pipelined) Table Functions
Traditionally, once we'd loaded the flat file data into a staging table via a SQL*Loader script, the next stage would be to transform the data, either through a series of DML commands or by using a PL/SQL package. The transformation would have one or more stages, and the stages would run sequentially, one after the other.
Oracle 9i now gives us the opportunity to improve on this situation, by allowing us to create our transformations as functions - functions that take our external table as an input, and output the transformed data as rows as columns that can be used to update a table. These functions, known as Table Functions, can also be 'pipelined', which makes Oracle start returning rows from the function before processing has finished. By pipelining table functions, you can string a number of them together and have them 'pass off' rows to the next process as soon as a batch of rows are transformed, meeting our requirement to minimize our inter-process wait states. So how does this work in practice?
create or replace type rec_contract_type is object ( contract_id number, descrip varchar2(50), init_val_loc_curr number, init_val_adj_amt number ); /create table contracts_dim
(
contract_id number,
descrip varchar2(50),
init_val_loc_curr number,
init_val_adj_amt number
);create or replace type table_contract_type is table of rec_contract_type;
/CREATE OR REPLACE PACKAGE transform_pkg
AS
TYPE contract_cur IS REF CURSOR RETURN contracts_file%ROWTYPE;
FUNCTION go (contract_arg IN contract_cur)
RETURN table_contract_type
PARALLEL_ENABLE (PARTITION contract_arg BY ANY)
PIPELINED;
END;
/CREATE OR REPLACE PACKAGE BODY transform_pkg AS
FUNCTION go (contract_arg IN contract_cur)
RETURN table_contract_type
PARALLEL_ENABLE (PARTITION contract_arg BY ANY)
PIPELINED
IS
contract_rec_in rec_contract_type := rec_contract_type(null,null,null,null);
contract_rec_out rec_contract_type := rec_contract_type(null,null,null,null);
BEGIN
LOOP
FETCH contract_arg INTO contract_rec_in.contract_id,contract_rec_in.descrip,
contract_rec_in.init_val_loc_curr, contract_rec_in.init_val_adj_amt;
exit when contract_arg%NOTFOUND;
contract_rec_out.contract_id := contract_rec_in.contract_id;
contract_rec_out.descrip := contract_rec_in.descrip;
contract_rec_out.init_val_loc_curr := contract_rec_in.init_val_loc_curr;
contract_rec_out.init_val_adj_amt :=
contract_rec_in.init_val_loc_curr*1.6;
PIPE ROW (contract_rec_out);
end loop;
return;
end;
end;
/
As the example shows, the function takes as it's argument a ref cursor which points to the data being fed into the transformation, and then sends as it's output another ref cursor that points to the set of transformed rows as processed by the function. The PIPELINED clause tells the function to return rows as soon as a PIPE ROW() command is used, which happens every time a row in the cursor is processed by the function. The function therefore takes as its input a pointer to the results of a SELECT statement, and spits out at the other end a pointer to the transformed version of these rows and columns.
The Merge Command
What we want to do now, is take the transformed data and use it to update our dimension table. In the past, we'd have accomplished this by a combination of INSERT and UPDATE statements, but with Oracle 9i we've now got the MERGE command which does this in one step. What's more, we can wrap up the whole transformation process into this one MERGE
alter session enable parallel dml;MERGE /*+ parallel(contract_dim,10) append */
INTO contracts_dim d
USING (select * from table(
transform_pkg.go(CURSOR(
select /*+parallel(contracts_file,10) full(contracts_file)*/ *
from contracts_file)))) f
ON (d.contract_id = f.contract_id)
WHEN MATCHED THEN
update set descrip = f.descrip,
init_val_loc_curr = f.init_val_loc_curr,
init_val_adj_amt = f.init_val_adj_amt
WHEN NOT MATCHED THEN
insert values ( f.contract_id,
f.descrip,
f.init_val_loc_curr,
f.init_val_adj_amt)
/
So there we have it - our complex ETL function all contained within a single MERGE statement. No separate SQL*Loader phase, no staging tables, and all piped through and loaded in parallel.
Oracle Warehouse Builder
The good news for OWB users is that recent releases of OWB (9.0.4 upwards) support table functions as transformations in a mapping process.
The current version of OWB (10g Release 1) supports table functions that you've written and then import into the OWB repository, whilst the next release, codenamed 'Paris', will allow you to build the table function graphically within OWB. This support for table functions, together with the ability to fine tune the load parameters and take advantage of 9i ETL functions like MERGE and pipelining actually invalidate Bert's argument that ETL tools don't produce efficient, optimal code, although to be honest all of this is fairly new functionality and the book is a couple of years old now.
Conclusion
With the new ETL functionality in Oracle 9i and 10g, plus the support for this functionality in OWB 9i and 10g, there's really no need now to split every transformation into a series of staging tables and separate transformations. If you start to use features such as table functions, external tables, parallelism and pipelining, you can start to take advantage of the features Oracle have built into the database engine specifically for loading data warehouses.
If you want to read more about this subject, you can check out Bert Scalzo's excellent book "Oracle
* OTN article on Table Functions
* OTN OWB Forum thread on Table Functions
* "Loading and transform external data into Oracle 9i"
* "CASTing About For a Solution: Using CAST and Table Functions in PL/SQL"
* "Using Pipelined Table Functions (Oracle 9i)"
Mark Rittman is a Certified Oracle Professional DBA and works as a Consulting Manager at Plus Consultancy, specializing in developing BI & DW applications using the Oracle Database, Application Server, Discoverer, Warehouse Builder and Oracle OLAP. Outside of Plus Consultancy, Mark chairs the UKOUG BI & Reporting Tools SIG, and runs a weblog dedicated to Oracle BI & Data Warehousing technology.
Mark recently was awarded an OTN Community Award for contributions to the Oracle developer community, and is a regular speaker at Oracle User events in both the UK and Europe.
- Mark Rittman's blog
- Log in to post comments
Comments
Very informative piece. Thanks, Mark. Pipelined functions seem to have a great potential in DW environment. I'd nice to find more tutorials on how to use them with downloadable scripts to play with.
Why, if I cut and paste this code directly into JDeveloper, do I get the error "Invalid PARTITION enabler"? I've desperately Google'd this error, and searched for it on MetaLink; nothing. In fact, I get this error WHENEVER I try to PARALLEL_ENABLE a pipelined function.
An eyeopener!
All the advantages are well articulated,
This article helped me in
This article helped me in designing our ETL process since we are using Oracle9i.
kudos mark!
ETL Pipelined Table Functions...
Thank you for these very interesting informations ! I have been playing with your script on my ORACLE 10.2 Database (Standard Installation) on a Notebook with an Intel Centrino Duo T5600 1,83GHz 987MHz 0,99 GB RAM. So I was trying to tune this script with a 1.1 GB-File it’s about 192000 rows in the external table. What I found out is: You can speed up the update/insert enormously (especially when you have a lot of updates) when you create a temporary table for inserting your rows with the call of the pipelined table function before you execute the merge command without the pipelined table function. In between you must commit.