Home » Server Options » Text & interMedia » Find subset of string (Oracle 11g, Win 7)
Find subset of string [message #636011] Tue, 14 April 2015 11:14 Go to next message
rajivn786
Messages: 161
Registered: January 2010
Senior Member
Hi,

I am having a hard time to extract text.

Here is the sample create & insert statements.
 

Create table x (a varchar2(4000)); 
Insert into x values ('CT head neg (-) Pt reportedly has 20-39% stenosis of carotid arteries. EKG NSR@70bpm, Trop (-) x2 Unlikely to be cardiac in etiology. Likely ');       
Insert into x values ('Midbrain lacunar CVA+L 40-59% carotid stenosis+TIA 6/2010 (finished Stein sr center outpt PT+pt reports improving word finding.  Did reschedule fu  outpt PT');
Insert into x values ('Repeat Carotid Dopplers 12/12: <20% stenosis b/l at the bifurcation  TTE 10/2014: EF 49%  TEE: 12/15 EF 35% and fibroelastic lesion of the aortic valve  EEG');

CREATE  /*+ parallel(x 32) */    index x_idx on  x (a) indextype
        is ctxsys.context parameters ('stoplist ctxsys.empty_stoplist')
        parallel 32 
       


The o/p needed

20-39%
40-59%
<20%
 


So, effectively, the output should search for term " "stenosis" and ( "carotid" or "%") within 5 words & pick up the range mentioned before or after term stenosis which will be followed or preceeded within 2/3 words mostly.

The code which I started is :

Select  --+ parallel(t 32)
  a 
  from   x
 where contains(a, 'near((stenosis,carotid={%}), 5)') > 0    
Re: Find subset of string [message #636022 is a reply to message #636011] Tue, 14 April 2015 20:36 Go to previous message
Barbara Boehmer
Messages: 9100
Registered: November 2002
Location: California, USA
Senior Member
In order to search for % you must declare the % as a printjoin in a lexer and use that lexer in your index parameters. You can see whether the % is indexed by selecting the token_text column from the dr$your_index_name$i domain index table. In order to search for something that ends with % you must precede it with the % wildcard, so you must use the backslash instead of curly bracketes for escaping like %\%. You can test this by searching separately for just that. You can use substr and instr to find the substring that ends in % then find the substring of that from the last space. You could probably also use regexp_substr instead if you like. Please see the demonstration below. I added some rows to the test data.

SCOTT@orcl> Create table x (a varchar2(4000))
  2  /

Table created.

SCOTT@orcl> Insert ALL
  2  into x values
  3    ('CT head neg (-) Pt reportedly has 20-39% stenosis of carotid arteries. EKG NSR@70bpm, Trop (-) x2 Unlikely to be cardiac in etiology. Likely ')
  4  into x values
  5    ('Midbrain lacunar CVA+L 40-59% carotid stenosis+TIA 6/2010 (finished Stein sr center outpt PT+pt reports improving word finding.  Did reschedule fu  outpt PT')
  6  into x values
  7    ('Repeat Carotid Dopplers 12/12: <20% stenosis b/l at the bifurcation  TTE 10/2014: EF 49%  TEE: 12/15 EF 35% and fibroelastic lesion of the aortic valve  EEG')
  8  into x values
  9    ('percentage without key words 99%')
 10  into x values
 11    ('stenosis 0%')
 12  SELECT * FROM DUAL
 13  /

5 rows created.

SCOTT@orcl> COLUMN a FORMAT A80 WORD_WRAPPED
SCOTT@orcl> SELECT * FROM x
  2  /

A
--------------------------------------------------------------------------------
CT head neg (-) Pt reportedly has 20-39% stenosis of carotid arteries. EKG
NSR@70bpm, Trop (-) x2 Unlikely to be cardiac in etiology. Likely

Midbrain lacunar CVA+L 40-59% carotid stenosis+TIA 6/2010 (finished Stein sr
center outpt PT+pt reports improving word finding.  Did reschedule fu  outpt PT

Repeat Carotid Dopplers 12/12: <20% stenosis b/l at the bifurcation  TTE
10/2014: EF 49%  TEE: 12/15 EF 35% and fibroelastic lesion of the aortic valve
EEG

percentage without key words 99%
stenosis 0%

5 rows selected.

SCOTT@orcl> BEGIN
  2    CTX_DDL.CREATE_PREFERENCE ('test_lex', 'BASIC_LEXER');
  3    CTX_DDL.SET_ATTRIBUTE ('test_lex', 'PRINTJOINS', '%');
  4  END;
  5  /

PL/SQL procedure successfully completed.

SCOTT@orcl> CREATE /*+ parallel(x 32) */ index x_idx on x (a)
  2  indextype is ctxsys.context
  3  parameters
  4    ('LEXER	   test_lex
  5  	 stoplist  ctxsys.empty_stoplist')
  6  parallel 32
  7  /

Index created.

SCOTT@orcl> SELECT token_text FROM dr$x_idx$i
  2  /

TOKEN_TEXT
----------------------------------------------------------------
0%
10
12
15
20
20%
2010
2014
35%
39%
40
49%
59%
6
70BPM
99%
AND
AORTIC
ARTERIES
AT
B
BE
BIFURCATION
CARDIAC
CAROTID
CENTER
CT
CVA
DID
DOPPLERS
EEG
EF
EKG
ETIOLOGY
FIBROELASTIC
FINDING
FINISHED
FU
HAS
HEAD
IMPROVING
IN
KEY
L
LACUNAR
LESION
LIKELY
MIDBRAIN
NEG
NSR
OF
OUTPT
PERCENTAGE
PT
REPEAT
REPORTEDLY
REPORTS
RESCHEDULE
SR
STEIN
STENOSIS
TEE
THE
TIA
TO
TROP
TTE
UNLIKELY
VALVE
WITHOUT
WORD
WORDS
X2

73 rows selected.

SCOTT@orcl> SELECT *
  2  FROM   x
  3  WHERE  CONTAINS (a, '%\%') > 0
  4  /

A
--------------------------------------------------------------------------------
CT head neg (-) Pt reportedly has 20-39% stenosis of carotid arteries. EKG
NSR@70bpm, Trop (-) x2 Unlikely to be cardiac in etiology. Likely

Midbrain lacunar CVA+L 40-59% carotid stenosis+TIA 6/2010 (finished Stein sr
center outpt PT+pt reports improving word finding.  Did reschedule fu  outpt PT

Repeat Carotid Dopplers 12/12: <20% stenosis b/l at the bifurcation  TTE
10/2014: EF 49%  TEE: 12/15 EF 35% and fibroelastic lesion of the aortic valve
EEG

percentage without key words 99%
stenosis 0%

5 rows selected.

SCOTT@orcl> SELECT SUBSTR (a, INSTR (a, ' ', -1, 1) + 1)
  2  FROM   (SELECT SUBSTR (a, 1, INSTR (a, '%')) a
  3  	     FROM   x
  4  	     WHERE  CONTAINS
  5  		    (a,
  6  		    'NEAR ((stenosis, carotid=%\%), 5)') > 0)
  7  /

SUBSTR(A,INSTR(A,'',-1,1)+1)
--------------------------------------------------------------------------------
20-39%
40-59%
<20%
0%

4 rows selected.

Previous Topic: Abbreviations and acronyms thesaurus
Next Topic: wild card search on CLOB not giving require result in 12C
Goto Forum:
  


Current Time: Thu Nov 21 07:07:05 CST 2024