Re: need help designing a heavily concurrent reporting engine
Date: Tue, 04 Sep 2012 15:55:11 -0600
Message-ID: <504678BF.4010907_at_evdbt.com>
Maxim,
For ad-hoc activity, your best bet for a data model is dimensional <http://en.wikipedia.org/wiki/Dimensional_modeling> (a.k.a. star schema) with facts and dimesions.
Range-partition the fact table(s) by date. Use the EXCHANGE-PARTITION load technique described in this white paper <http://www.evdbt.com/TGorman%20TD2009%20DWScale.doc> and presentation <http://www.evdbt.com/OOW09%20DWScaling%20TGorman%2020091013.ppt>, allowing you to use INSERT+APPEND, nologging if necessary, TRUNCATE to remove data. Mostly direct-path loads for table data and index creation, very little "conventional" DML -- everything you're wishing for. :-)
If you have any slowly-changing dimensions, first load the "type 2"
(time-variant) dimension table(s) using the EXCHANGE-PARTITION load
technique, then re-load the "type 1" (point-in-time) dimension table(s)
using the exchange-partition technique illustrated on slides 25-29 of
the presentation
<http://www.evdbt.com/OOW09%20DWScaling%20TGorman%2020091013.ppt>.
Again, no need for expensive "conventional" operations; large volume
data manipulation is faster with INSERT+APPEND in parallel. No need for
all this is the dimension is not slowly-changing, by the way.
That's it in a nutshell. I've been using this stuff for ~14 years and talking to the world about it for over 8 years now. It's not easy to do, but it rocks.
By the way, I'll be doing my 2-day seminar on this topic (a.k.a. "/Scaling To Infinity: Partitioning Data Warehouses on Oracle Database/") for Oracle University in Munich Germany on 22-23 October <http://education.oracle.com/webreg/ShoppingCart?arg=1&doccode=D70363_1657076&av_id=&org_id=34&territory_code=DE&lang=D&fee=&event_id=3253864&p_return_url=http://education.oracle.com/pls/web_prod-plq-dad/searchResult.searchEngineParser?p_search_keyword=Gorman&p_org_id=1001&p_lang=US&p_country=34,41&p_search_location=&p_search_category_id=&p_search_format=&p_search_start_month=OCT&p_search_start_year=2012&p_search_end_month=NOV&p_search_end_year=2012&p_page_number=1&p_search_from_date=&p_search_to_date=&p_forceExpandedSearch=N> and in Utrecht Netherlands on 25-26 October <http://education.oracle.com/webreg/ShoppingCart?arg=1&doccode=D70363_1657076&av_id=&org_id=41&territory_code=NL&lang=NL&fee=&event_id=3253866&p_return_url=http://education.oracle.com/pls/web_prod-plq-dad/searchResult.searchEngineParser?p_search_keyword=Gorman&p_org_id=1001&p_lang=US&p_country=34,41&p_search_location=&p_search_category_id=&p_search_format=&p_search_start_month=OCT&p_search_start_year=2012&p_search_end_month=NOV&p_search_end_year=2012&p_page_number=1&p_search_from_date=&p_search_to_date=&p_forceExpandedSearch=N>. I'll be doing my 60-90 minute version of the 2-day seminar at the Slovenian Oracle Users Group in Llubjana Slovenia 15-16 October <http://www.sioug.si/index.php?option=com_content&view=article&id=201&Itemid=180&lang=en>.
I'm glad to do one or more of the 60-90 minute presentations over a GoToMeeting webinar as well, if anyone would be interested?
Please let me know if you have any questions.
Hope this helps...
-- Tim Gorman consultant -> Evergreen Database Technologies, Inc. postal => PO Box 352151, Westminster CO 80035 website => http://www.EvDBT.com/ email => Tim_at_EvDBT.com mobile => +1-303-885-4526 fax => +1-303-484-3608 Lost Data? => http://www.ora600.be/ for info about DUDE... On 9/4/2012 3:10 PM, Pakhutkin, Maxim (Max) wrote:Received on Tue Sep 04 2012 - 16:55:11 CDT
> Hello,
>
> I'm looking for some advice for designing a process for generating reports. The reports are ad-hoc, i.e. it's safe to assume that the reports will be generated continuously and concurrently, no start and finish point or quiet time. This basically data-warehouse-type process needs to live in a heavily-OLTP type system, so we hope to reduce datafile and redo I/O overhead as much as possible.
>
> The direction we've been going so far:
> 1) We are hoping to run multiple database sessions each working on its own set of data at a time (this corresponds to one report). This way we can add more processes if we need to speed things up (plenty of CPU and storage on the system). These sessions are called renderers (they render reports) and each will use its own render_key value to access its own data.
> 2) The processes would copy some detail data for each report into their own tables, do various joins of this detail data with other data to generate data for various sections of the report.
> 3) The detail data does not need to be restorable, so we are hoping to use INSERT+APPEND to put it in, TRUNCATE to get rid of it, and NOLOGGING on the tables/tablespaces that store it, to avoid the I/O overhead of using plain INSERT/DELETE.
> 4) We don't want to use GLOBAL temporary tables, if possible, because in addition to the render process accessing the data for its report, the "orchestrator" process will need to access the same data. Also, we want to be able to easily collect optimizer stats on the data and don't have to set the stats, which may be error prone and is a support overhead.
> 5) after each set of data is inserted, stats would be collected on that data.
>
> The goal is to get maximum speed out of it, but also hopefully to keep code fairly reasonably maintainable.
>
> At first I thought of partitioned tables, with each process working on it own partition. But I found out that INSERT+APPEND and TRUNCATE PARTITION don't play well with concurrent access to the same table (even though different partitions of the same table are being accessed). Collecting stats also would take out latches on the whole table in library cache, despite us collecting stats only on partition level. For all of the above we get library cache waits of various kinds, slowing things down significantly.
>
> Then we thought of using separate tables, but joining them into a union_all view (for ease of code maintenance), that way simulating a partitioned table. Each separate table would still have one column filled with the same value, dedicated to identifying which render process would work on it.
>
> CREATE TABLE X1 (
> Render_key number, <-- this column contains the same value for all rows in the table, and the appropriate bill rendering process would supply this value in queries to get only its own data from the union-all view.
> ...
> );
>
> CREATE TABLE X2(
> Render_key number,
> ...
> );
> Etc.
>
> Now we can INSERT+APPEND, but TRUNCATE doesn't work too well, as some processes accessing the view get "object no longer exists" error due to other processes truncating their tables that are part of the same view. (This may be partially because oracle tries to use a very selective index that exists on each of these tables first and then filter it by render_key identifier, instead of reading the render_key column histograms first, realizing that there is only one table where that column will have the right value and not touching the other tables, as I hoped it would. But I may be wrong and it would do that even without the index).
>
> I can't think of any other approaches that would get us what we want (speed + ease of code maintenance) and am ready to recommend to the developers that each process work in its own schema with its own tables (so code and tables would be duplicated everywhere, making maintaining it a harder).
>
> Can anyone think of other architectural alternatives that could give us what we want?
>
> Thanks in advance.
>
> --
> Max Pakhutkin
>
> --
> http://www.freelists.org/webpage/oracle-l
>
>
>
>
>
-- http://www.freelists.org/webpage/oracle-l