Re: How to create large data sets for performance test

From: Priit Piipuu <priit.piipuu_at_gmail.com>
Date: Thu, 1 Jun 2023 19:33:13 +0200
Message-ID: <CAJYY02iWTjg95z5psr6+Ev7_xqFj+eeCF27s=CXmLsEtJVF6Aw_at_mail.gmail.com>



Hi,

We built our own data masking tool and integrated it to the pipeline that creates the test environments. As a discovery we run regular expressions on data dictionary and update the offending columns. First version of the tool used hashing, but then people started to complain 064719ad4da102a1ebac60ff0f3454d6 does not look like a human name, and there were other, more subtle problems with hashing as well. So now we generate the fake data and do the mapping as a part of the update.

Data masking tools I've looked have been decidedly unimpressive. ReDiscovery and DataDefender can in theory use for discovery everything OpenNLP supports, but their support for Oracle is somewhat lacking. There are Oracle-specific tools which do really simple discovery, and some of them do not even have a documented API, how would you even use those? So for us, search for the working third party data masking product continues :)

On Mon, 29 May 2023 at 22:57, Lok P <loknath.73_at_gmail.com> wrote:

> Hello Listers ,
> We have one of the existing production system(Oracle database 19C Exadata)
> which is live and running on premise(Its a financial system).We have this
> data replicated to cloud(AWS S3/data lake) and then multiple transformation
> happens and finally moved to multiple downstream system/databases like
> Redshift, Snowflake etc on which reporting and analytics APIs/application
> runs.
>
> For doing performance tests for these reporting/analytics applications and
> also the data pipeline , we need to have a similar volume of data generated
> with the same data pattern/skewness and also with the same level of
> integrity constraints maintained as it exists in the current oracle
> production database. Performance of the databases like snowflake solely
> depends , the way incoming data is clustered and for that it's important
> that we have similar data pattern/skewness/order as that of the production
> environment or else it wont give accurate results. We are getting
> ~500millions of rows loaded into our key transaction Table(atleast 5-6
> tables are ~10+TB in size in production Oracle database) on a daily basis
> in the current production system. Current production system holds ~6months
> of data. And we want to do a performance test at least on the ~3 months
> worth of data.
>
> We thought of copying the current oracle production data to the
> performance environment , however the production data has many sensitive
> customer data/columns which can't be moved to other environments because of
> compliance restrictions. And also joining the masked column data can be
> challenging if they are not the same across tables. So I wanted to
> understand from experts, if there any easy way(or any tool etc) to generate
> similar performance data in such a high volume in quick time for the
> performance testing need?
>
> Regards
> Lok
>

--
http://www.freelists.org/webpage/oracle-l
Received on Thu Jun 01 2023 - 19:33:13 CEST

Original text of this message