RE: How to design data movement

From: Mark W. Farnham <mwf_at_rsiz.com>
Date: Sat, 17 Jun 2023 11:29:50 -0400
Message-ID: <2c6801d9a130$905f7f20$b11e7d60$_at_rsiz.com>



It seems likely your best solution is the “guard table” method implemented to process intermittently arriving and possible partially incomplete or corrupted transaction records from stores over noisy asynchronous satellites circa 1990. (Burlington Coat Factory Warehouse Gerry Claggett, and yours truly on behalf of Rightsizing, Inc.)  

First, you create the “guard table” (probably the persuasive case for the implementation of index organized tables), which is simply a single column unique constrained table. Initially guard tables were implemented with a single column table with a single column unique index).  

For each batch table thread, create a control total table with the name, initial arrival date (optional, but useful), the number of transactions in that file, and (initially zero) the number of transactions completed from that file.  

(Now at the time of inventing guard tables, an insert failure due to unique violation was faster than grabbing just “not in”, but you can test either way using the guard table (you only need a single guard table for a set of batch threads) either as a not in selection for a given transaction, or just process the batch an rely on the insert failure (or insert) to guard against duplicate processing.
 

A batch source table is complete when the number processed matches the number sent in that file. (You can keep the candidate file list small by periodically insert appending complete rows to history, and rename, copy incomplete back to the active control total table (once again, your fastest update is an insert. Don’t delete completed rows. If you don’t need the history [but you will need the history], you can skip the copy to history step [which you will later implement when someone insists that you can prove transactions from each thread were completed]).  

So you pull all the transactions from all the incomplete batch table threads in a reasonable monolith size ordered by transaction id, with the next to last step adding to the completed total for each file and the last step before commit being inserting into the guard table (or checking that it is not there if your tests show that to now be faster), and then the commit either succeeds or fails on unique violation (in which case you do a rollback).  

On failure you can either reduce the monolith size in half, or reduce the monolith size to 1, or use a not in filter on the guard table (which you might already be doing if that is now faster).  

(In practice the texture of the data a BCFW was such that reducing the monolith size to 1 turned out to be the least total work, which was the most important thing other than getting the right answer on 75 Mhz chips).
 

Repeatedly grinding the grist and throwing away already completed transactions in this way, along with knowing when an input batch file could be archived turned out to be really useful.  

Someone may have a better way, but I don’t. In practice we used both “count of transactions” and “total dollar amount” in the control tables for the purpose of knowing how much was partially received.  

We used the date of initial receipt of the file to trigger investigation of why something more than a certain amount of time had not been completed.  

Good luck, noisy and differentially time lagged components of what are essentially atom transactions are tricky.  

mwf      

From: yudhi s [mailto:learnerdatabase99_at_gmail.com] Sent: Friday, June 16, 2023 6:40 AM
To: Mark W. Farnham; Pap
Cc: Oracle L
Subject: Re: How to design data movement  

Thank you so much Mark and Pap.  

We also got some scenarios in which say, table1 is having transactions from 1 to 200 loaded in stage as part of a batch. And as part of same batch table2 has transactions from 1 to 100 loaded to stage schema. However Table3 has transactions loaded from 1 to 200 but missing 50 transactions in between(I.e 50 to 100). And those missing transactions may come in next batch to table3.  

In above case , the logic of, taking minimum of the max transactions from each table and pulling records based on that from stage schema, will go wrong. It will miss those 50 transactions.  

So I believe to handle this scenario, we have no other options than to find a suitable minimum waiting period so as to ensure all related record for a transactions has made it to stage schema for all the table. Again this is no full proof solution though, as we can't have very long waiting or buffer period and at the same time, in case of any mishaps in upstream the waiting period may not be enough to to collect all completed transactions from related tables. But then I think we will have to manually load/cleanup data in those cases.  

Please correct if my understanding is wrong here. Or any other possible solution exists to handle this?  

On Sat, 10 Jun, 2023, 4:16 pm Mark W. Farnham, <mwf_at_rsiz.com> wrote:

If there are multiple rows arriving for a single transaction id from any of the sources that can arrive in staggered commits, that is a different issue.  

In that case, the next question is: “Is there any way to know when a particular transaction is complete on each source?”  

IF there is a way to know, then a one row per complete transaction “spine” from each source becomes the reasonable thing to pull for the intersection list, and waiting for completeness from all sources is a reasonable solution unless something particularly time sensitive makes that unreasonable. Again, air traffic control and missile defense come to mind.  

IF the same “transaction id” is being used days or months later, I would claim that is an identifier of a group of transactions, not a single transaction. IF, on the other hand, the rows of the transaction are intended to take place in a small amount of elapsed time and there is a way to know whether (or not) the transaction is complete on each source, I would argue to wait until the transaction is complete on all sources before adding it into the analytics. The description of “what is a transaction” from a single commit at a single place to an integrated “business object” is something that needs to be carefully defined for each application system being built.  

IF there is uncertainty about whether the arrival of information about a transaction is incomplete from any source, I don’t have a good solution for that other than the Tim Gorman solution.  

Amongst the less than ideal solutions is plug values of a reasonable guess at the maximum length of the final value to minimize row length change on the update. The crossover point where you are better off using the full partition replacement (exchange) method to accommodate “updates” is often a surprisingly small percentage of transactions, but it is definitely a measurable slope.  

The smaller the time slice, the more likely you are to have few enough “updates” of missing information to make the sum of the expense of the updates less than the expense partition exchange, and likewise, the bigger the sum of the sizes of the partitions with “updates.” An hour is a pretty small time slice for analytics for most human decisions.  

Somewhere between years and microseconds the adjective should change from analytics to reflexes.  

mwf  

From: yudhi s [mailto:learnerdatabase99_at_gmail.com] Sent: Saturday, June 10, 2023 4:55 AM
To: Mark W. Farnham
Cc: Oracle L
Subject: Re: How to design data movement  

Thank You mark.  

If I get you correct , in such a situation the approach should be to find out the max(Txn_ids) from all the tables , so in the above scenario it will be '4' for table1, '2' for table2 and 3 for table3. And then take the Min of all those txn_ids which will be txn_id '2' here. and only load the data till the txn_ids<=2 with equi join across all the tables. And rest of the txn_ids data should get picked up in subsequent hourly runs. Hope my understanding is correct here.  

But also even with this approach , like table2 and table3 having multiple rows with same txn_ids , in those cases if only one row came in the first batch but other rows for same txn_ids came in the subsequent batch, the will be chances of missing some of the rows for some txn_ids. Is there any way to handle such a scenario?  

Regarding below point, do you mean to say , we will pick all the txn_ids data using outer join, however in place of null we will manually put some default values so that the length of the row will not increase when we update the column with the exact value afterwards for catching up missing txn_ids data?  

"IF you insist on UPDATE, find yourself an acceptable stand-in for NULL for as many of the columns as you can that is either the maximum width of the column or something like 80 percent of the maximum column width. UPDATEs are still expensive, but they are less expensive if the rows don’t grow on insert."  

Or should we wait and only run the batch in this downstream by ~1hr delay , means if upstream batch will push the data in stage for 9hrs , the the downstream batch will be triggered for the 8hrs batch(i.e. giving a buffer of 1hrs) to minimize the missing txn_ids data in any table. Will this be the correct approach?      

On Fri, Jun 9, 2023 at 4:57 PM Mark W. Farnham <mwf_at_rsiz.com> wrote:

My experience is that the folks USING the data find difficulties using missing bits. Only rarely is it useful to include partial data. Air traffic control and missile defense come to mind.  

IF your analysts agree with my analysis, THEN the best way I know of is to find the maximum transaction id for which you have data in all the sources and use that as a maximum filter on each.  

This relies on no “Swiss Cheese” batches where some transaction ids are skipped in some sources and are filled in later, in which case you can also record the maximum transaction already processed (zero the first time unless you have negative transaction ids), which also presumes that your transaction ids are monotonically increasing (like time in most of our universe.)  

Or you can simply use inner joins, or build just an inner join spine intersecting the tables pairwise until you have a spine that will work when you pull all the non-intersection columns after building the spine.  

Your proposal to build with NULLs for missing values is fraught with problems. UPDATES are expensive, and filling in NULLs expands each row dynamically. (IF your analysts actually prefer to include rows with temporarily missing data, read up on Tim Gorman’s explanations of why and how an INSERT is the fastest update if your data is even mildly batchy, which probably fits your data arrival texture).  

IF you insist on UPDATE, find yourself an acceptable stand-in for NULL for as many of the columns as you can that is either the maximum width of the column or something like 80 percent of the maximum column width. UPDATEs are still expensive, but they are less expensive if the rows don’t grow on insert.  

Good luck.  

People are really making decisions hourly? That seems like a madhouse to me for MOST real world usefulness. Even when trends are important on a short trigger, using the math for trends based on new arrivals against the existing averages for some weighted time period is probably better than generating a full new data set.  

mwf      

From: oracle-l-bounce_at_freelists.org [mailto:oracle-l-bounce_at_freelists.org] On Behalf Of yudhi s Sent: Friday, June 09, 2023 1:31 AM
To: Oracle L
Subject: How to design data movement  

Hello Listers,

We have multiple transaction tables and data on those are streamed from upstream systems using tools like kafka, glue etc. And we have an analytical database at the destination which is consuming the data and going to publish reports on those data. In the analytical database , the data gets dumped in a stage schema as is and then procedures written to pull the stage tables data transform the data and populate the main schema tables(which are having different structures) and on them reporting/analytical queries are getting executed. Some structures in the main schema are one to one as its on stage but others are joining/aggregating multiple "stage tables" data into one table in "main schema".

Now the issue is , we are getting the batches from the source loaded to the stage schema of the analytical database hourly once. And we are seeing, even the tables are holding the related data(i.e. the transaction ID column) but they are coming with different lags.Say for e.g. at any point in time when we are trying to pull data from stage and populate main schema(which is also going to be hourly once frequency), the stage tables looks something as below.

Table1 has all the data/maximum number of Transaction ids but table2 and table3 is lagging behind. Table2 is yet to be populated with the transaction ids- 3,4 data and also Table3 is yet to be populated with transaction id-4 data and that will mostly be populated in the stage area on the analytical database in the next hourly batch by the upstream job.

So how should we handle this scenario?

So in the above scenario we are thinking of 1)Pick up all the required data for a specific batch, from stage, using "outer join" in TXN_ID column from related tables first and populate main schema tables. And this will populate all the TXN_ID's in the main schema but with null values for attributes of table TABLE2 and TABLE3 i.e. col3, col4, col5, col6 will be populated as NULL for the missing Txn_ids.

2)In the next hourly run of the procedures, we will check , if any of the values of the column(col3, col4, col5,col6) showing NULL then we will execute UPDATE queries to populate those columns of the main schema table , for those specific transaction IDS from Table2 and Table3.  

I Wanted to understand from experts if there exists any better way to handle such data streaming challenges?

TABLE1
TXN_ID COL1 COL2 Batch_time

1        A     B    9hrs
2        C     D    9hrs
3        E     F    9hrs
4        G     H    9hrs


TABLE2
TXN_ID COL3 COL4 Batch_time

1       I     J    9hrs
1       K     L    9Hrs
2       M     N    9hrs


TABLE3
TXN_ID COL5 COL6 Batch_time

1       O     P   9hrs
2       Q     R   9hrs
3       S     T   9hrs
3       U     V   9hrs

 

Regards

Yudhi

--
http://www.freelists.org/webpage/oracle-l
Received on Sat Jun 17 2023 - 17:29:50 CEST

Original text of this message