Re: need help designing a heavily concurrent reporting engine
Date: Wed, 5 Sep 2012 07:39:24 -0400
Message-ID: <CAGurbTPDPVaSR=rbzg1hB8nMb4NvE5FfCzR2WiX7r1ThfaVAdg_at_mail.gmail.com>
Here is an idea,
- let your N renderer processes truncate, load, transform their own pieces of data into renderer-specific tables (one for each renderer, no one else needs to access it.).
- Once your renderer is ready with data for orchestrator process, let it insert+append into a partitioned table (any which way you want to partition), commit, and be done with it.
- your orchestrator process accesses this only one table (mentioned in 2). it can collect stats on necessary partitions, generate reports and delete old data and exit.
This means, truncating renderer specific tables will not have impact on any other process, since each renderer process works in its own space, the stats collection on it would only impact itself. Orchastrator process will access a central table that everyone will only insert into, so as long as renderer processes keep doing that and orchastrator process does eventual cleanup of the data in this central table, they all should remain independent.
Not sure if this makes a whole lot of sense but I am still catching up on my first cup of coffee and quickly thought of this to avoid dependency. GTTs might work with above method as well, but if you have continuously running processes, any DDL on GTTs (when necessary) would be difficult.
Raj
On Tue, Sep 4, 2012 at 5:10 PM, Pakhutkin, Maxim (Max)
<maxim.pakhutkin_at_windstream.com> wrote:
> Hello,
>
> I'm looking for some advice for designing a process for generating reports. The reports are ad-hoc, i.e. it's safe to assume that the reports will be generated continuously and concurrently, no start and finish point or quiet time. This basically data-warehouse-type process needs to live in a heavily-OLTP type system, so we hope to reduce datafile and redo I/O overhead as much as possible.
>
> The direction we've been going so far:
> 1) We are hoping to run multiple database sessions each working on its own set of data at a time (this corresponds to one report). This way we can add more processes if we need to speed things up (plenty of CPU and storage on the system). These sessions are called renderers (they render reports) and each will use its own render_key value to access its own data.
> 2) The processes would copy some detail data for each report into their own tables, do various joins of this detail data with other data to generate data for various sections of the report.
> 3) The detail data does not need to be restorable, so we are hoping to use INSERT+APPEND to put it in, TRUNCATE to get rid of it, and NOLOGGING on the tables/tablespaces that store it, to avoid the I/O overhead of using plain INSERT/DELETE.
> 4) We don't want to use GLOBAL temporary tables, if possible, because in addition to the render process accessing the data for its report, the "orchestrator" process will need to access the same data. Also, we want to be able to easily collect optimizer stats on the data and don't have to set the stats, which may be error prone and is a support overhead.
> 5) after each set of data is inserted, stats would be collected on that data.
>
> The goal is to get maximum speed out of it, but also hopefully to keep code fairly reasonably maintainable.
>
> At first I thought of partitioned tables, with each process working on it own partition. But I found out that INSERT+APPEND and TRUNCATE PARTITION don't play well with concurrent access to the same table (even though different partitions of the same table are being accessed). Collecting stats also would take out latches on the whole table in library cache, despite us collecting stats only on partition level. For all of the above we get library cache waits of various kinds, slowing things down significantly.
>
> Then we thought of using separate tables, but joining them into a union_all view (for ease of code maintenance), that way simulating a partitioned table. Each separate table would still have one column filled with the same value, dedicated to identifying which render process would work on it.
>
> CREATE TABLE X1 (
> Render_key number, <-- this column contains the same value for all rows in the table, and the appropriate bill rendering process would supply this value in queries to get only its own data from the union-all view.
> ...
> );
>
> CREATE TABLE X2(
> Render_key number,
> ...
> );
> Etc.
>
> Now we can INSERT+APPEND, but TRUNCATE doesn't work too well, as some processes accessing the view get "object no longer exists" error due to other processes truncating their tables that are part of the same view. (This may be partially because oracle tries to use a very selective index that exists on each of these tables first and then filter it by render_key identifier, instead of reading the render_key column histograms first, realizing that there is only one table where that column will have the right value and not touching the other tables, as I hoped it would. But I may be wrong and it would do that even without the index).
>
> I can't think of any other approaches that would get us what we want (speed + ease of code maintenance) and am ready to recommend to the developers that each process work in its own schema with its own tables (so code and tables would be duplicated everywhere, making maintaining it a harder).
>
> Can anyone think of other architectural alternatives that could give us what we want?
>
> Thanks in advance.
>
> --
> Max Pakhutkin
>
> --
> http://www.freelists.org/webpage/oracle-l
>
>
-- http://www.freelists.org/webpage/oracle-lReceived on Wed Sep 05 2012 - 06:39:24 CDT