Oracle FAQ | Your Portal to the Oracle Knowledge Grid |
![]() |
![]() |
Home -> Community -> Mailing Lists -> Oracle-L -> Re: Algorithm or ideas wanted for creative text parsing
Raj,
It looks to me like you're going to have to do some rule based ETL.
Start by parsing your URIs on the dots into varrays and then examinining
data .... if seg. first == 'www' and seg.last == "com" then harvest
seg.length-1 ...
if seg.last='uk' and seg.length-1='co' then harvest seg.length-2 ...
et cetera ad endless nauseam.
But I don't think you can build a script that will reliably trundle out there and correctly get what you want first try.
Good luck
Gus
rjamya wrote:
>If I don't distinguish "blueyonder.co.uk" and "demon.co.uk", it will
>be just "co.uk" and that means most of commercial domains under UK
>tld. It will be akin to bundling most of us sites under ".com" alone.
>
>if I had to take only the last 2 parts, it is a piece of cake, I
>wouldn't trouble this list for such a small RTFM issue. The problem I
>have is much more complicated.
>
>And no, this isn't a rhetorical question at all.
>Raj
>
>On 4/10/06, sol beach <sol.beach_at_gmail.com> wrote:
>
>
>>Rhetorical question -
>>
>> On what basis will the s/w "decide" whether 2 (akamaistream.net) parts or 3
>>(blueyonder.co.uk) parts
>> is the "right" answer?
>>
>>
>>On 4/10/06, rjamya <rjamya_at_gmail.com> wrote:
>>
>>
>> Basically I am looking to isolate just the (distinct) domain name from
>>fully qualified domain names that you'd normally see in web-surfing.
>>
>>I am working on couple of techniques, but it gets complicated since
>> TLDs differ in format and there is only so much you can do with
>>substr().
>>
>>
>>
>--
>----------------------------------------------
>Got RAC?
>--
>http://www.freelists.org/webpage/oracle-l
>
>
>
>
>
-- http://www.freelists.org/webpage/oracle-lReceived on Mon Apr 10 2006 - 14:30:38 CDT
![]() |
![]() |