Re: How to create large data sets for performance test

From: Lok P <loknath.73_at_gmail.com>
Date: Wed, 31 May 2023 02:03:43 +0530
Message-ID: <CAKna9VbTWMerG-Kr5P9FFOz4JteB0KQkiLPZjfkzuRBXMiQbQQ_at_mail.gmail.com>



Thank You Lothar.

In this above scenario I do understand the production data is moving to the cloud with certain security protocols keeping all compliance into consideration and those production environments are fully isolated. But the other environments are not treated that way(say for example performance environment) , so that is why we wanted to ensure certain more fields to be masked.

Will need to see if any of the primary/foreign keys also has sensitive PI/PCI data in them. Also correct me if wrong, just in case we have such data then , is there any way to mask the input data using some key , so that it can be reverted back to the original values using the same key whenever needed?

Basically we have no experience with data masking before, so wanted to understand what would be the fastest possible way to do the masking of this large amount of production data? And it seems either we need to mask the data in the Source database itself (say Oracle) and move it or mask it on S3 bucket and move or at the end of the stream (at redshift/snowflake etc). So need guidance on, if there exists any common way/tool, through which we can achieve this ?

Regards
Lok

On Tue, May 30, 2023 at 3:00 PM Lothar Flatz <l.flatz_at_bluewin.ch> wrote:

> Hi Lok,
>
> Actually that is not the kind of job that should be done in the cloud,
> as it is resource heavy and needs a lot of control.
> Especially when you replace Exadata with different Hardware in a DWH
> context, be careful. There are abilities of Exadata that no other
> Hardware can provide.
> There is one the that puzzels me: if you consider to move the data to
> the cloud, would not the same security concerns apply as you have it now?
> If you decide to move the data to the cloud when your test is sucessful,
> why would it be ok then, but not now?
>
> Other than that: you could consider to change just the sensitive
> customer data and leave the keys (foreign or primary) as they are. I
> hope your keys are just meaningless numbers, are they?
> I know that there are tools on the market to anonymize data. Even Oracle
> sells them. Never used any though.
>
> Regards
>
> Lothar
>
> Am 29.05.2023 um 22:55 schrieb Lok P:
> > Hello Listers ,
> > We have one of the existing production system(Oracle database 19C
> > Exadata) which is live and running on premise(Its a financial
> > system).We have this data replicated to cloud(AWS S3/data lake) and
> > then multiple transformation happens and finally moved to multiple
> > downstream system/databases like Redshift, Snowflake etc on which
> > reporting and analytics APIs/application runs.
> >
> > For doing performance tests for these reporting/analytics applications
> > and also the data pipeline , we need to have a similar volume of data
> > generated with the same data pattern/skewness and also with the same
> > level of integrity constraints maintained as it exists in the current
> > oracle production database. Performance of the databases like
> > snowflake solely depends , the way incoming data is clustered and for
> > that it's important that we have similar data pattern/skewness/order
> > as that of the production environment or else it wont give accurate
> > results. We are getting ~500millions of rows loaded into our key
> > transaction Table(atleast 5-6 tables are ~10+TB in size in production
> > Oracle database) on a daily basis in the current production system.
> > Current production system holds ~6months of data. And we want to do a
> > performance test at least on the ~3 months worth of data.
> >
> > We thought of copying the current oracle production data to the
> > performance environment , however the production data has many
> > sensitive customer data/columns which can't be moved to other
> > environments because of compliance restrictions. And also joining the
> > masked column data can be challenging if they are not the same across
> > tables. So I wanted to understand from experts, if there any easy
> > way(or any tool etc) to generate similar performance data in such a
> > high volume in quick time for the performance testing need?
> >
> > Regards
> > Lok
>
>

--
http://www.freelists.org/webpage/oracle-l
Received on Tue May 30 2023 - 22:33:43 CEST

Original text of this message