Finding illegal UTF8 sequences

From: Weaver, Walt <wweaver_at_rightnow.com>
Date: Thu, 27 May 2004 12:37:43 -0600
Message-ID: <D30BE1A2F9109A43BA989E2F5168405606A6D236@pobox.corp.rightnow.com>

Is anyone experienced with finding illegal UTF8 sequences and doing something about them?

We have a UTF8 database containing Japanese data. One of the customers appears to have random malformed data; when the data is displayed it's displayed as random characters rather than Kanji characters.

Using the dump() function I've found sequences where there appears to be, say, a valid trail byte with no associated lead byte. I've found a valid three-character lead byte with no associated trail byte, and so on and so on.

At least, I think that's what I've found.=20

At this point I'm still in a bit of learning mode here and am still trying to figure out what I'm looking at and what I'm going to do.

This problem is isolated to one customer and may be the result of a data import that was done some time ago.

So, does anyone know of any utilities that can find and print out illegal UTF8 sequences? Or am I going to have to hire someone to do it for me (I'm not smart enough to be able to do that sort of thing)?

Thanks,
--Walt Weaver

Bozeman, Montana

Please see the official ORACLE-L FAQ: http://www.orafaq.com

To unsubscribe send email to: oracle-l-request_at_freelists.org put 'unsubscribe' in the subject line.
--

Archives are at http://www.freelists.org/archives/oracle-l/ FAQ is at http://www.freelists.org/help/fom-serve/cache/1.html

Received on Thu May 27 2004 - 13:35:24 CDT