Home » RDBMS Server » Performance Tuning » Updating text string to mask address characters (Oracle 11)
Updating text string to mask address characters [message #632771] |
Thu, 05 February 2015 09:33 |
|
brown_zzz
Messages: 39 Registered: August 2012 Location: United Kingdom
|
Member |
|
|
I'm trying to do an update on a column in 100 millions rows in a table to mask the real addresses by replacing an x for every letter except the first in every word in a string. Hence "22 The Old Kent Road" becomes "22 Txx Oxx Kxxx Rxxx".
I've tried both statements below and they're both very slow - no significant difference in speed. Does anyone know a faster way of doing it please?
To test:
SELECT regexp_replace(INITCAP(LOWER('22 the old kent road')),'[a-z]','x') from dual;
SELECT regexp_replace(INITCAP(LOWER('22 the old kent road')),':lower:','x') from dual;
To implement:
UPDATE my_table set my_column_string = regexp_replace(INITCAP(LOWER(my_column_string)),'[a-z]','x');
UPDATE my_table set my_column_string = regexp_replace(INITCAP(LOWER(my_column_string)),':lower:','x');
|
|
|
|
|
|
Re: Updating text string to mask address characters [message #632802 is a reply to message #632780] |
Thu, 05 February 2015 18:41 |
rleishman
Messages: 3728 Registered: October 2005 Location: Melbourne, Australia
|
Senior Member |
|
|
The performance of updating every row in a large table has very little to do with the functions used in the SELECT list. It has far more to do with the amount of UNDO (rollback) IO and REDO (recovery) IO you are generating.
A neat way to minimise these overheads is to create another table that contains the values you want (CREATE TABLE new AS SELECT func(...), other columns FROM old). Then when you are done, drop (or rename) the old table and rename the new table in its place. Also remember to rebuild indexes.
Ross Leishman
|
|
|
|
Re: Updating text string to mask address characters [message #632836 is a reply to message #632824] |
Fri, 06 February 2015 03:15 |
|
FDAVIDOV
Messages: 20 Registered: December 2014
|
Junior Member |
|
|
Brown_zzz,
I don't know if you solved your problem, but anyway here is my 5 cents contribution.
The fact that you are able to "PHRASE" the replacement in a very ellegant way (using regular exp.) does not change the fact that the PROCESSOR needs to handle each character one by one. When you use regexp_replace you don't really have any control on HOW the analysis of the string takes place, nor how and when the replacement is done.
I would suggest that you try implementing a small piece of code that scans the string character by character and (sort of State-machine):
- Define a variable that would display the STATE (e.g. 'IN_WORD', 'IN_SEPARATOR', 'IN_NUMBER' - I use here strings to explain the idea, and you may wish to use a number instead of strings to increase performace),
- Define a variable (TARGET VARIABLE) to receive the new "value" after replacing character with "X",
- The first LETTER you encounter you copy as-is to a target variable and set the state to 'IN_WORD',
- While in state 'IN_WORD', any character (letter, symbol, digit) you find till the next separator (blank, comma, etc.) you append an "X" to the target variable,
- As soon as you find a separator (" ", ",", etc.) you change the state to 'IN_SEPARATOR' and append as-is the separator to the target string,
- When you are in state 'IN_SEPARATOR' and find a digit, you change the state to 'IN_NUMBER' and append as-is any additional character to the target variable till the next separator (or end of string).
I'm not sure if this is the very best and fastest algorithm, but I'm quite sure you will see a significant boost in performance.
As for the suggestion of copying the data to a new table... I'm not sure if this will help (100 million records to copy is not something small).
Good luck,
Fernando.
|
|
|
|
Re: Updating text string to mask address characters [message #632841 is a reply to message #632840] |
Fri, 06 February 2015 03:48 |
|
FDAVIDOV
Messages: 20 Registered: December 2014
|
Junior Member |
|
|
Michel,
No doubt, a built-in function that DOES THE SAME would be faster than a (NO SO COMPLEX) piece of PL/SQL.
My point is that with regexp_replace you may find youself scanning the whole string several times instead of once (of course, if you know the exact details of the internal algorithm implemented within this function, I might change my opinion).
As for copying the whole table, once again, I'm not so sure (if may depend on, for instance, how many and complext indices the table has). Also, I don't quite know how the internals of Oracle handle updates, but if I would need to guess, it would smartly exploit the fact that both initial and final strings are of the same size (if not, it would be a little bit disapointing...).
Developing the small functionality I mentioned is something around 1/2 how or so, and hence I think it is worth trying.
Wouldn't you be happy to see some EVIDENCES proving you RIGHT or WRONG?
Of course, Brown_zzz my think differently...
|
|
|
Re: Updating text string to mask address characters [message #632845 is a reply to message #632841] |
Fri, 06 February 2015 04:00 |
cookiemonster
Messages: 13958 Registered: September 2008 Location: Rainy Manchester
|
Senior Member |
|
|
FDAVIDOV wrote on Fri, 06 February 2015 09:48Michel,
No doubt, a built-in function that DOES THE SAME would be faster than a (NO SO COMPLEX) piece of PL/SQL.
My point is that with regexp_replace you may find youself scanning the whole string several times instead of once (of course, if you know the exact details of the internal algorithm implemented within this function, I might change my opinion).
So you're assuming that oracle can't write a competent regexp?
FDAVIDOV wrote on Fri, 06 February 2015 09:48
As for copying the whole table, once again, I'm not so sure (if may depend on, for instance, how many and complext indices the table has).
Also, I don't quite know how the internals of Oracle handle updates, but if I would need to guess, it would smartly exploit the fact that both initial and final strings are of the same size (if not, it would be a little bit disapointing...).
Well if you have a lot of indexes then it's possible the time spent recreating them outweighs the time saved by doing CTAS, but absent that it's a known fact among people who do these things that CTAS is a lot faster than update, try it if you don't believe us.
FDAVIDOV wrote on Fri, 06 February 2015 09:48
Developing the small functionality I mentioned is something around 1/2 how or so, and hence I think it is worth trying.
Wouldn't you be happy to see some EVIDENCES proving you RIGHT or WRONG?
Why don't you knock up an example then and see how it compares?
|
|
|
Re: Updating text string to mask address characters [message #632846 is a reply to message #632841] |
Fri, 06 February 2015 04:03 |
|
FDAVIDOV
Messages: 20 Registered: December 2014
|
Junior Member |
|
|
Michel,
One more thing we somethines tend to forget:
As fas as I recall, the Oracle engine is written in "C" language, right? (in fact, it doesn't really matter if it is "C", "C+", "C++", or any other langauge; the way ALL programming languages TRANSLATE into machine code is more or less the same for the current available processors technologies).
Now, think how would your "C" code would look if you would need to implement this functionality...
All the best,
Fernando.
[Updated on: Fri, 06 February 2015 04:04] Report message to a moderator
|
|
|
|
|
|
Re: Updating text string to mask address characters [message #632917 is a reply to message #632851] |
Fri, 06 February 2015 21:27 |
rleishman
Messages: 3728 Registered: October 2005 Location: Melbourne, Australia
|
Senior Member |
|
|
Baseline: Generate 1M rows with varying mixed case content
WITH n AS (SELECT level AS n FROM DUAL CONNECT BY level <= 1000)
SELECT MAX(INITCAP(TO_CHAR(sysdate + MOD(n1.n*1000 + n2.n, 10000), 'JSp')))
FROM n n1
CROSS JOIN n n2
Result: 2.66 seconds
Progression: REGEXP_REPLACE with character class
WITH n AS (SELECT level AS n FROM DUAL CONNECT BY level <= 1000)
SELECT MAX(REGEXP_REPLACE(INITCAP(TO_CHAR(sysdate + MOD(n1.n*1000 + n2.n, 10000), 'JSp')), '[[:lower:]]', 'x'))
FROM n n1
CROSS JOIN n n2
Result: 15.65 seconds (cost = 13sec / 1M)
Progression: REGEXP_REPLACE with Character Range
WITH n AS (SELECT level AS n FROM DUAL CONNECT BY level <= 1000)
SELECT MAX(REGEXP_REPLACE(INITCAP(TO_CHAR(sysdate + MOD(n1.n*1000 + n2.n, 10000), 'JSp')), '[[:lower:]]', 'x'))
FROM n n1
CROSS JOIN n n2
Result: 15.37 seconds (cost = 13sec / 1M)
Progression: TRANSLATE()
WITH n AS (SELECT level AS n FROM DUAL CONNECT BY level <= 1000)
SELECT MAX(
TRANSLATE(
INITCAP(TO_CHAR(sysdate + MOD(n1.n*1000 + n2.n, 10000), 'JSp'))
, 'abcdefghijklmnopqrstuvwxyz', 'xxxxxxxxxxxxxxxxxxxxxxxxxx'))
FROM n n1
CROSS JOIN n n2
Result: 20.56 seconds (cost = 18sec / 1M)
Conclusion: Over the course of 300M rows, there is 3900 seconds spent on REGEXP_REPLACE. That's about an hour. (I ran my tests on Oracle's APEX cloud service; your mileage might differ, but you could run the queries above locally to get some benchmarks).
If that's going to make a critical difference to your query, then by all means go looking for a better alternative to REGEXP_REPLACE (not that I think you'll find one)
However, I very much doubt you would have posted this here if it had finished in a couple of hours. If I'm right and it's taking many, many hours, then maybe you should go back and check out that CTAS solution we discussed above.
Ross Leishman
|
|
|
Re: Updating text string to mask address characters [message #632922 is a reply to message #632771] |
Sat, 07 February 2015 02:15 |
John Watson
Messages: 8960 Registered: January 2010 Location: Global Village
|
Senior Member |
|
|
May I suggest a radical solution? Use the data redaction facility, available from 11.2.0.4.
It has many advantages over the technique you are using now. First, it is declarative: set up the policy, and everyone gets it according to session context. Second, the redaction policies can of course be much more sophisticated than the masking demo'ed so far. Third, it occurs (I think) at column projection time, so no performance hit. Lastly, you don't actually update anything, so the underlying data (incuding performance characteristics) are preserved.
Yes, you have to buy the Advanced Security Option - but you may find it easy to motivate the purchase. You get TDE and network encrption as well.
|
|
|
Re: Updating text string to mask address characters [message #632947 is a reply to message #632917] |
Sun, 08 February 2015 00:32 |
|
FDAVIDOV
Messages: 20 Registered: December 2014
|
Junior Member |
|
|
Hello guys,
I'm addressing here two entries: the one from Ross Leishman and the one from John Watson.
Will start from the second (easier and shorter):
It would be reasonable to assume that, if the only thing Brown_zzz needs is to mask information upon display/select, he/she could just do that (e.g. select REGEXP_REPLACE(...)) and wouldn't post the question to begin with. Hence, it appears natural to think that he/she has all the addresses in the DB already and needs to mask them.
Now to Ross:
I made a small experiment which is generating a date using to_char(SYSDATE + i ,'DAY DD MONTH YYYY') [here, "i" is the index in a loop from 1 to nnn] getting "MONDAY 09 FEBRUARY 2015". Then, used this string as the input for masking using two methods: (1) a simple function as I suggested in a previous entry and (2) the REGEXP_REPLACE function.
The results are as follows (looping 1,000,000 times in both cases; the "Cost" was manually calculated and added to the printing of the code - see below):
Using my simple function:
Starting loop at: 08-02-2015 07:54:15.287399000
Masking of:"MONDAY 09 FEBRUARY 2015"
is: Mxxxxx 09 Fxxxxxxx 2015
Masking of:"MONDAY 05 JANUARY 4753"
is: Mxxxxx 05 Jxxxxxx 4753
Ending loop at: 08-02-2015 07:54:37.772596000 -----> Cost: 22.485197 secs.
Using regexp_replace:
Starting loop at: 08-02-2015 07:55:56.270955000
Masking of:"MONDAY 09 FEBRUARY 2015"
is: Mxxxxx 09 Fxxxxxxx 2015
Masking of:"MONDAY 05 JANUARY 4753"
is: Mxxxxx 05 Jxxxxxx 4753
Ending loop at: 08-02-2015 07:56:20.162582000 -----> Cost: 23.891627 secs.
Conclussions:
1) For THIS PARTICULAR NEED(!), it would appear that using a custom function gives better results than using built-in functions (though a different of ~1.5 secs is too small to make a final assertion),
2) Update the existing table VS. create a new table needs to be further examined (whoever can do it, should run both scenarios and, if kind enough, post the results herein).
Some additional comments to Ross' findings:
The test using TRANSLATE is neither "fare" nor applicable to the actual need (i.e. first letter of a name should remain unchanged). Not "fare" because you are asking the system that, for each and every character in the input, scan two strings ("ascd..." and "xxxx...") to find the match; in fact, you don't quite care which letter is in the input and you just need to replace it with "x".
Finally, I'm adding here the code I used for the test explained above.
Fernando.
set serveroutput on
DECLARE
l_Date_String VARCHAR2(64) ;
l_Result_String VARCHAR2(64) ;
l_Loops NUMBER ;
FUNCTION Mask_String(p_In_String VARCHAR2) RETURN VARCHAR2
AS
l_Length NUMBER := LENGTH(p_In_String) ;
l_State NUMBER := 0 ; -- 0 = init, 1 = in word, 2 = in number, 3 = in separator
l_Result VARCHAR2(128) := '' ;
l_This_Char CHAR(1) ;
Begin
for i in 1..l_Length
LOOP
l_This_Char := substr( p_In_String , i, 1 ) ;
CASE l_State
WHEN 0 THEN
IF ( l_This_Char >= '0') AND ( l_This_Char <= '9') THEN
l_State := 2 ; -- in number ;
ELSIF ( l_This_Char = ' ') OR ( l_This_Char = ',') THEN
l_State := 3 ; -- in separator ;
ELSE
l_State := 1 ; -- in word ;
END IF ;
l_Result := l_Result || l_This_Char ;
WHEN 1 THEN
IF ( l_This_Char = ' ') OR ( l_This_Char = ',') THEN
l_State := 3 ; -- in separator ;
l_Result := l_Result || l_This_Char ;
ELSE
l_Result := l_Result || 'x' ;
END IF ;
WHEN 2 THEN
IF ( l_This_Char = ' ') OR ( l_This_Char = ',') THEN
l_State := 3 ; -- in separator ;
END IF ;
l_Result := l_Result || l_This_Char ;
WHEN 3 THEN
IF ( l_This_Char >= '0') AND ( l_This_Char <= '9') THEN
l_State := 2 ; -- in number ;
ELSIF ( l_This_Char = ' ') OR ( l_This_Char = ',') THEN
l_State := 3 ; -- in separator ;
ELSE
l_State := 1 ; -- in word ;
END IF ;
l_Result := l_Result || l_This_Char ;
END CASE ;
End Loop ;
RETURN l_Result ;
End ;
BEGIN
l_Loops := 1000000 ;
dbms_output.put_line('Starting loop at: ' || to_char(systimestamp,'dd-mm-yyyy hh24:mi:ss.FF') ) ;
FOR i IN 1..l_Loops
Loop
l_Date_String := to_char(SYSDATE + i ,'DAY DD MONTH YYYY') ;
--- l_Result_String := Mask_String(l_Date_String) ;
l_Result_String := regexp_replace(INITCAP(LOWER(l_Date_String)),'[a-z]','x') ;
IF i in (1,l_Loops) THEN
dbms_output.put_line('Masking of:"' || l_Date_String || '"' || CHR(10) ||
' is: ' || l_Result_String ) ;
END IF ;
End Loop ;
dbms_output.put_line('Ending loop at: ' || to_char(systimestamp,'dd-mm-yyyy hh24:mi:ss.FF') ) ;
END ;
MOD EDIT: ADDED CODE TAGS - RL
[Updated on: Sun, 08 February 2015 04:51] by Moderator Report message to a moderator
|
|
|
Re: Updating text string to mask address characters [message #632948 is a reply to message #632947] |
Sun, 08 February 2015 05:00 |
rleishman
Messages: 3728 Registered: October 2005 Location: Melbourne, Australia
|
Senior Member |
|
|
A test of rebuild vs UPDATE
SQL>
SQL> CREATE TABLE bigtest (
2 pk
3 ,fk
4 ,txt
5 )
6 TABLESPACE BIG
7 AS
8 WITH n AS (SELECT level AS n FROM DUAL CONNECT BY level <= 1000)
9 SELECT
10 CAST(n1.n * 1000000 + n2.n * 1000 + n3.n AS NUMBER(10))
11 , CAST(MOD(n1.n * 1000000 + n2.n * 1000 + n3.n, 54321) AS NUMBER(10))
12 , CAST(INITCAP(TO_CHAR(sysdate + MOD(n1.n * 1000000 + n2.n*1000 + n3.n, 10000), 'JSp')) AS VARCHAR2(150))
13 FROM n n1
14 CROSS JOIN n n2
15 CROSS JOIN n n3
16 WHERE ROWNUM <= 10000000
17 /
Table created.
Elapsed: 00:01:38.48
SQL>
SQL> CREATE UNIQUE INDEX bigtest_pk ON bigtest(pk) TABLESPACE BIG;
Index created.
Elapsed: 00:01:03.44
SQL> CREATE INDEX bigtest_fk ON bigtest(fk) TABLESPACE BIG;
Index created.
Elapsed: 00:00:58.91
SQL>
SQL>
SQL> CREATE TABLE bigtest1
2 AS
3 SELECT pk, fk, CAST(regexp_replace(txt, '[[:lower:]]', 'x') AS VARCHAR2(150)) AS txt
4 FROM bigtest
5 /
Table created.
Elapsed: 00:04:04.44
SQL>
SQL>
SQL> UPDATE bigtest
2 SET txt = regexp_replace(txt, '[[:lower:]]', 'x')
3 /
10000000 rows updated.
Elapsed: 00:17:57.51
SQL>
SQL>
SQL>
SQL> SET TIMING OFF
SQL> SPOOL OFF
Summary:
- Rebuild the table: 4:04
- Build 2 indexes: 2:02
- Total for rebuild 10M row table + 2 indexes: 6:06
- UPDATE 10M rows: 17:57
Can we possibly put this one to bed now and just SUGGEST to the OP that rebuilding the table from scratch is worth a go - not a solution necessarily, just worth a go - without getting too distracted about saving a potential 7% on writing a better function. The reasons - as I discussed earlier - are because of UNDO and REDO IO, which form the bulk of the cost of the UPDATE, and can be largely eliminated by rebuilding the table.
Also John's solution (Data Redaction) bears investigation.
Ross Leishman
|
|
|
|
Re: Updating text string to mask address characters [message #632954 is a reply to message #632952] |
Sun, 08 February 2015 17:44 |
rleishman
Messages: 3728 Registered: October 2005 Location: Melbourne, Australia
|
Senior Member |
|
|
FDAVIDOV wrote on Mon, 09 February 2015 04:26
2) I'll mark a very small point to me (in spite commonly found assumptions, sometimes you CAN write something more efficient than an Oracle internal).
You're very quick to award yourself a point, even if it is a small one. Let's take a deeper look at your function.
First, I will replicate your results on my own database:
TESTING BESPOKE FUNCTION
Starting loop at: 08-02-2015 21:59:36.701000000
Masking of:"MONDAY 09 FEBRUARY 2015"
is: Mxxxxx 09 Fxxxxxxx 2015
Masking of:"MONDAY 05 JANUARY 4753"
is: Mxxxxx 05 Jxxxxxx 4753
Ending loop at: 08-02-2015 21:59:56.039000000
PL/SQL procedure successfully completed.
Starting loop at: 08-02-2015 22:00:19.593000000
Masking of:"MONDAY 09 FEBRUARY 2015"
is: Mxxxxx 09 Fxxxxxxx 2015
Masking of:"MONDAY 05 JANUARY 4753"
is: Mxxxxx 05 Jxxxxxx 4753
Ending loop at: 08-02-2015 22:00:38.961000000
PL/SQL procedure successfully completed.
Starting loop at: 08-02-2015 22:00:48.855000000
Masking of:"MONDAY 09 FEBRUARY 2015"
is: Mxxxxx 09 Fxxxxxxx 2015
Masking of:"MONDAY 05 JANUARY 4753"
is: Mxxxxx 05 Jxxxxxx 4753
Ending loop at: 08-02-2015 22:01:08.403000000
PL/SQL procedure successfully completed.
Summary of results:
1: 19.3
2: 19.4
3: 19.5
TEST 1 : REGEXP_REPLACE [a-z]
Starting loop at: 08-02-2015 22:14:00.874000000
Masking of:"MONDAY 09 FEBRUARY 2015"
is: Mxxxxx 09 Fxxxxxxx 2015
Masking of:"MONDAY 05 JANUARY 4753"
is: Mxxxxx 05 Jxxxxxx 4753
Ending loop at: 08-02-2015 22:14:21.033000000
PL/SQL procedure successfully completed.
Starting loop at: 08-02-2015 22:14:26.551000000
Masking of:"MONDAY 09 FEBRUARY 2015"
is: Mxxxxx 09 Fxxxxxxx 2015
Masking of:"MONDAY 05 JANUARY 4753"
is: Mxxxxx 05 Jxxxxxx 4753
Ending loop at: 08-02-2015 22:14:46.529000000
PL/SQL procedure successfully completed.
Starting loop at: 08-02-2015 22:14:52.047000000
Masking of:"MONDAY 09 FEBRUARY 2015"
is: Mxxxxx 09 Fxxxxxxx 2015
Masking of:"MONDAY 05 JANUARY 4753"
is: Mxxxxx 05 Jxxxxxx 4753
Ending loop at: 08-02-2015 22:15:11.786000000
PL/SQL procedure successfully completed.
Summary of results:
1: 20.2
2: 20.0
3: 19.7
Yep, it definitely looks like the bespoke function is a tiny bit faster than REGEXP_REPLACE.
But what about this line of code in your test case:
l_Result_String := regexp_replace(INITCAP(LOWER(l_Date_String)),'[a-z]','x') ;
That LOWER function seems redundant. It converts the whole string to lower case, but then the INITCAP function does the same thing again except for the initial letters of each word.
Let's try again with the redundant LOWER removed
TEST 2 : REGEXP_REPLACE [a-z] (with removed redundant LOWER)
Starting loop at: 08-02-2015 22:17:01.934000000
Masking of:"MONDAY 09 FEBRUARY 2015"
is: Mxxxxx 09 Fxxxxxxx 2015
Masking of:"MONDAY 05 JANUARY 4753"
is: Mxxxxx 05 Jxxxxxx 4753
Ending loop at: 08-02-2015 22:17:21.372000000
PL/SQL procedure successfully completed.
Starting loop at: 08-02-2015 22:17:37.826000000
Masking of:"MONDAY 09 FEBRUARY 2015"
is: Mxxxxx 09 Fxxxxxxx 2015
Masking of:"MONDAY 05 JANUARY 4753"
is: Mxxxxx 05 Jxxxxxx 4753
Ending loop at: 08-02-2015 22:17:57.394000000
PL/SQL procedure successfully completed.
Starting loop at: 08-02-2015 22:18:05.916000000
Masking of:"MONDAY 09 FEBRUARY 2015"
is: Mxxxxx 09 Fxxxxxxx 2015
Masking of:"MONDAY 05 JANUARY 4753"
is: Mxxxxx 05 Jxxxxxx 4753
Ending loop at: 08-02-2015 22:18:25.064000000
PL/SQL procedure successfully completed.
Summary of results:
1: 19.4
2: 19.6
3: 19.2
Hmmm. These results are almost identical to the bespoke function now. We've lost all of our performance improvements.
Just out of interest (it shouldn't make any difference, right?) let's rerun the PL/SQL for REGEXP_REPLACE with the code for the bespoke function removed from the PL/SQL block. After all, it's not being called in this test, so it's fairer if we run REGEXP_REPLACE with it removed.
TEST 3 : REGEXP_REPLACE [a-z] (with removed redundant bespoke function)
Starting loop at: 08-02-2015 22:23:07.129000000
Masking of:"MONDAY 09 FEBRUARY 2015"
is: Mxxxxx 09 Fxxxxxxx 2015
Masking of:"MONDAY 05 JANUARY 4753"
is: Mxxxxx 05 Jxxxxxx 4753
Ending loop at: 08-02-2015 22:23:26.327000000
PL/SQL procedure successfully completed.
Starting loop at: 08-02-2015 22:23:46.746000000
Masking of:"MONDAY 09 FEBRUARY 2015"
is: Mxxxxx 09 Fxxxxxxx 2015
Masking of:"MONDAY 05 JANUARY 4753"
is: Mxxxxx 05 Jxxxxxx 4753
Ending loop at: 08-02-2015 22:24:05.723000000
PL/SQL procedure successfully completed.
Starting loop at: 08-02-2015 22:24:17.060000000
Masking of:"MONDAY 09 FEBRUARY 2015"
is: Mxxxxx 09 Fxxxxxxx 2015
Masking of:"MONDAY 05 JANUARY 4753"
is: Mxxxxx 05 Jxxxxxx 4753
Ending loop at: 08-02-2015 22:24:36.538000000
PL/SQL procedure successfully completed.
Summary of results:
1: 19.2
2: 19.0
3: 19.5
That was unexpected. Just having that little extra code in the block to parse (and maybe optimise) was costing a couple of tenths of a second. We now have the situation where REGEXP_REPLACE appears faster than the bespoke function - maybe not statistically faster, but it's an interesting result.
But hang on, in my earlier SQL test, it seemed like the Character Class [[:lower:]] in REGEXP_REPLACE was measurably faster than the Character Range [a-z]. Wouldn't it be fairer to pit the bespoke function against the faster built-in function?
Let's replace
l_Result_String := regexp_replace(INITCAP(l_Date_String),'[a-z]','x') ;
With
l_Result_String := regexp_replace(INITCAP(l_Date_String),'[[:lower:]]','x') ;
TEST 4 : REGEXP_REPLACE [[:lower:]]
Starting loop at: 08-02-2015 22:40:23.509000000
Masking of:"MONDAY 09 FEBRUARY 2015"
is: Mxxxxx 09 Fxxxxxxx 2015
Masking of:"MONDAY 05 JANUARY 4753"
is: Mxxxxx 05 Jxxxxxx 4753
Ending loop at: 08-02-2015 22:40:38.651000000
PL/SQL procedure successfully completed.
Starting loop at: 08-02-2015 22:40:51.610000000
Masking of:"MONDAY 09 FEBRUARY 2015"
is: Mxxxxx 09 Fxxxxxxx 2015
Masking of:"MONDAY 05 JANUARY 4753"
is: Mxxxxx 05 Jxxxxxx 4753
Ending loop at: 08-02-2015 22:41:06.772000000
PL/SQL procedure successfully completed.
Starting loop at: 08-02-2015 22:41:09.365000000
Masking of:"MONDAY 09 FEBRUARY 2015"
is: Mxxxxx 09 Fxxxxxxx 2015
Masking of:"MONDAY 05 JANUARY 4753"
is: Mxxxxx 05 Jxxxxxx 4753
Ending loop at: 08-02-2015 22:41:25.018000000
PL/SQL procedure successfully completed.
Summary of results:
1: 15.1
2: 15.2
3: 15.6
Right, for the first time we seem to have a statistically significant result. REGEXP_REPLACE with a Character Class appears to be about 25% faster than the bespoke function.
But surely we're missing the whole point here. Didn't the OP present to us a SQL problem? So why are we trying to solve it with a language other than SQL? Surely if this is to have any value, we need to test it in SQL.
TEST CASE 5 : REGEXP_REPLACE VS BESPOKE FUNCTION IN SQL
Let's save the bespoke function to the database so we can use it in SQL
SQL>
SQL>
SQL> CREATE OR REPLACE FUNCTION Mask_String(p_In_String VARCHAR2) RETURN VARCHAR2
2
3 AS
4
5 l_Length NUMBER := LENGTH(p_In_String) ;
6 l_State NUMBER := 0 ; -- 0 = init, 1 = in word, 2 = in number, 3 = in separator
7 l_Result VARCHAR2(128) := '' ;
8 l_This_Char CHAR(1) ;
9 Begin
10
11 for i in 1..l_Length
12 LOOP
13 l_This_Char := substr( p_In_String , i, 1 ) ;
14
15 CASE l_State
16
17 WHEN 0 THEN
18
19 IF ( l_This_Char >= '0') AND ( l_This_Char <= '9') THEN
20 l_State := 2 ; -- in number ;
21 ELSIF ( l_This_Char = ' ') OR ( l_This_Char = ',') THEN
22 l_State := 3 ; -- in separator ;
23 ELSE
24 l_State := 1 ; -- in word ;
25 END IF ;
26
27 l_Result := l_Result || l_This_Char ;
28
29 WHEN 1 THEN
30 IF ( l_This_Char = ' ') OR ( l_This_Char = ',') THEN
31 l_State := 3 ; -- in separator ;
32 l_Result := l_Result || l_This_Char ;
33 ELSE
34 l_Result := l_Result || 'x' ;
35 END IF ;
36
37 WHEN 2 THEN
38 IF ( l_This_Char = ' ') OR ( l_This_Char = ',') THEN
39 l_State := 3 ; -- in separator ;
40 END IF ;
41
42 l_Result := l_Result || l_This_Char ;
43
44 WHEN 3 THEN
45 IF ( l_This_Char >= '0') AND ( l_This_Char <= '9') THEN
46 l_State := 2 ; -- in number ;
47 ELSIF ( l_This_Char = ' ') OR ( l_This_Char = ',') THEN
48 l_State := 3 ; -- in separator ;
49 ELSE
50 l_State := 1 ; -- in word ;
51 END IF ;
52
53 l_Result := l_Result || l_This_Char ;
54
55 END CASE ;
56
57 End Loop ;
58
59 RETURN l_Result ;
60
61 End ;
62 /
Function created.
And now lets pit it against REGEXP_REPLACE (Character Class) in SQL, rather than PL/SQL
SQL> WITH n AS (SELECT level AS n FROM DUAL CONNECT BY level <= 1000)
2 SELECT MAX(REGEXP_REPLACE(INITCAP(TO_CHAR(sysdate + MOD(n1.n*1000 + n2.n, 10000), 'JSp')), '[[:lower:]]', 'x'))
3 FROM n n1
4 CROSS JOIN n n2
5 /
MAX(REGEXP_REPLACE(INITCAP(TO_CHAR(SYSDATE+MOD(N1.N*1000+N2.N,10000),'JSP')),'[[:LOWER:]]','X'))
----------------------------------------------------------------------------------------------------
Txx Mxxxxxx Fxxx Hxxxxxx Sxxxx-Txxxx Txxxxxxx Txxxxxxx
Elapsed: 00:00:21.93
SQL> WITH n AS (SELECT level AS n FROM DUAL CONNECT BY level <= 1000)
2 SELECT MAX(Mask_String(TO_CHAR(sysdate + MOD(n1.n*1000 + n2.n, 10000), 'JSp')))
3 FROM n n1
4 CROSS JOIN n n2
5 /
MAX(MASK_STRING(TO_CHAR(SYSDATE+MOD(N1.N*1000+N2.N,10000),'JSP')))
----------------------------------------------------------------------------------------------------
Txx Mxxxxxx Fxxx Hxxxxxx Sxxxxxxxxxx Txxxxxxx Txxxxxxxxxxx
Elapsed: 00:00:47.75
Okay, so now the Bespoke function seems twice as slow as the built-in equivalent. This is going to be true for most if not all built-in functions in SQL vs interpreted bespoke functions, because Oracle uses a separate engine to execute interpreted PL/SQL. This means for every call to the PL/SQL function, we wear the cost of a context shift from SQL to PL/SQL. As can be seen above, the cost of these context shifts is NOT negligible.
And in case we were thinking NATIVE compilation with bespoke PL/SQL functions was the cure for all of our built-in function woes, I recompiled the bespoke function using PLSQL_CODE_TYPE = NATIVE
SQL> WITH n AS (SELECT level AS n FROM DUAL CONNECT BY level <= 1000)
2 SELECT MAX(Mask_String(TO_CHAR(sysdate + MOD(n1.n*1000 + n2.n, 10000), 'JSp')))
3 FROM n n1
4 CROSS JOIN n n2
5 /
MAX(MASK_STRING(TO_CHAR(SYSDATE+MOD(N1.N*1000+N2.N,10000),'JSP')))
----------------------------------------------------------------------------------------------------
Txx Mxxxxxx Fxxx Hxxxxxx Sxxxxxxxxxx Txxxxxxx Txxxxxxxxxxx
Elapsed: 00:00:37.20
It was definitely faster. But not faster than the built-in function.
My conclusion:
Maybe you can sometimes write something faster than built-in functions, but not in this case.
Ross Leishman
|
|
|
Re: Updating text string to mask address characters [message #632960 is a reply to message #632954] |
Sun, 08 February 2015 23:42 |
|
FDAVIDOV
Messages: 20 Registered: December 2014
|
Junior Member |
|
|
Ross,
Quite interesting indeed....
Few comments:
1) In your TEST 3 (run the block using regexp_replace and after removing the unused function) you got a better result. This is quite interesting because the START and END time are measured just before and after the loop and hence it is not clear why the function would be parsed more than once (if at all).
2) Creating a function within the DB (TEST 5): from the very beginning, the problem presented by Brown_zzz was in relation to a once-in-a-lifetime operation, and hence you wouldn't pollute the DB with a function that would be used only once, right?
3) Since I'm as happy about being right as I am learning something new, I'm gladly moving my little point to you, so the updated scores now are:
- Ross : 1.1 Points
- Fernando : 0.0 Points
- Oracle's Consistency : -10.0 Points
Shall we put this to rest now?
|
|
|
Goto Forum:
Current Time: Thu Nov 21 10:10:58 CST 2024
|