Oracle FAQ | Your Portal to the Oracle Knowledge Grid |
Home -> Community -> Mailing Lists -> Oracle-L -> Re: Algorithm or ideas wanted for creative text parsing
So far I have done ....
CASE
WHEN :new.domain_name LIKE '%.imageshack.us' THEN 'imageshack.us'
-- we need to collapse these
WHEN :new.domain_name LIKE '%.adtexh.de' THEN 'adtech.de' --
we need to collapse these
WHEN domain_name LIKE '%.echo.cx' THEN 'echo.cx' -- we need to
collapse these
WHEN domain_name LIKE '%.exs.cx' THEN 'exs.cx' -- we need to collapse these WHEN domain_name LIKE '%.bigoo.ws' THEN 'bigoo.ws' -- we need tocollapse these
Any gotcha's, missed rules are welcome. The results so far are pretty good ... a sample query on 4m+ rows so far shows reliable output. there will always be caveats, but I am happy with 99% "hit ratio" ... though any improvements are always welcome.
uh oh ... I said the H word ... I am marked now ... Raj
On 4/10/06, Gus Spier <gspier_at_chiliad.com> wrote:
> Raj,
> It looks to me like you're going to have to do some rule based ETL. Start
> by parsing your URIs on the dots into varrays and then examinining data ....
> if seg. first == 'www' and seg.last == "com" then harvest seg.length-1 ...
>
> if seg.last='uk' and seg.length-1='co' then harvest seg.length-2 ...
>
> et cetera ad endless nauseam.
>
> But I don't think you can build a script that will reliably trundle out
> there and correctly get what you want first try.
>
> Good luck
> Gus
-- ---------------------------------------------- Got RAC? -- http://www.freelists.org/webpage/oracle-lReceived on Mon Apr 10 2006 - 14:46:19 CDT