Where the data stitching/update/deduplication should happen in a pipeline

From: yudhi s <learnerdatabase99_at_gmail.com>
Date: Thu, 29 Feb 2024 10:11:38 +0530
Message-ID: <CAEzWdqe3yvNq7njyB9EcOY7PC=6smFgAxWsV30+TfwQ582oyeQ_at_mail.gmail.com>



Hi,
We are designing a system which is going to move data from input files(in Avro format) to goldengate to kafaka topics to the database. Incoming files-->GGS--> KAFKA-->OLTP Database. This would be a heavy transactional sysetm processing ~10K txn/second. The database is supposed to show the near real time transactions to the users. The transactions which come from kafka topics will be asynchronous in nature and also there are chances of duplicate data being ingested from kafka topics. So the data has to be stitched/ updated/deduplicated before showing it to the users a complete transaction or say before persisting it to the normalized data model which would be ready for the querying by the end users.

So where should we perform these stitching/update/deduplication stuff in this workflow? Should it happen inside the application somewhere in the kafka(using poison pill concept) or should it happen in a stage area in the database by persisting all the pieces as is coming from kafka topics. Adding another stage layer is going to add some more time to the data to be visible to the users and thus it may not be near real time.

Or should we persist the data as is in the stage area and show the data from stage itself if some users are okay with partial transaction data and showing the complete transaction data from the normalized table to other users who want to see it as a complete transaction?

What is the appropriate design to address such use cases?

Regards
Yudhi

--
http://www.freelists.org/webpage/oracle-l
Received on Thu Feb 29 2024 - 05:41:38 CET

Original text of this message