How to design data movement

From: yudhi s <learnerdatabase99_at_gmail.com>
Date: Fri, 9 Jun 2023 11:01:17 +0530
Message-ID: <CAEzWdqcrZB8RoK-FLEsLsO+aVPBESmNz7TMF7CK9=kS=apBQdQ_at_mail.gmail.com>



Hello Listers,
We have multiple transaction tables and data on those are streamed from upstream systems using tools like kafka, glue etc. And we have an analytical database at the destination which is consuming the data and going to publish reports on those data. In the analytical database , the data gets dumped in a stage schema as is and then procedures written to pull the stage tables data transform the data and populate the main schema tables(which are having different structures) and on them reporting/analytical queries are getting executed. Some structures in the main schema are one to one as its on stage but others are joining/aggregating multiple "stage tables" data into one table in "main schema".

Now the issue is , we are getting the batches from the source loaded to the stage schema of the analytical database hourly once. And we are seeing, even the tables are holding the related data(i.e. the transaction ID column) but they are coming with different lags.Say for e.g. at any point in time when we are trying to pull data from stage and populate main schema(which is also going to be hourly once frequency), the stage tables looks something as below.

Table1 has all the data/maximum number of Transaction ids but table2 and table3 is lagging behind. Table2 is yet to be populated with the transaction ids- 3,4 data and also Table3 is yet to be populated with transaction id-4 data and that will mostly be populated in the stage area on the analytical database in the next hourly batch by the upstream job.

So how should we handle this scenario?

So in the above scenario we are thinking of 1)Pick up all the required data for a specific batch, from stage, using "outer join" in TXN_ID column from related tables first and populate main schema tables. And this will populate all the TXN_ID's in the main schema but with null values for attributes of table TABLE2 and TABLE3 i.e. col3, col4, col5, col6 will be populated as NULL for the missing Txn_ids.

2)In the next hourly run of the procedures, we will check , if any of the values of the column(col3, col4, col5,col6) showing NULL then we will execute UPDATE queries to populate those columns of the main schema table , for those specific transaction IDS from Table2 and Table3.

I Wanted to understand from experts if there exists any better way to handle such data streaming challenges?

TABLE1
TXN_ID COL1 COL2 Batch_time

1        A     B    9hrs
2        C     D    9hrs
3        E     F    9hrs
4        G     H    9hrs


TABLE2
TXN_ID COL3 COL4 Batch_time

1       I     J    9hrs
1       K     L    9Hrs
2       M     N    9hrs


TABLE3
TXN_ID COL5 COL6 Batch_time

1       O     P   9hrs
2       Q     R   9hrs
3       S     T   9hrs
3       U     V   9hrs

Regards
Yudhi

--
http://www.freelists.org/webpage/oracle-l
Received on Fri Jun 09 2023 - 07:31:17 CEST

Original text of this message