Re: CLOB's and Character Sets

From: Kim Berg Hansen <kibeha_at_gmail.com>
Date: Wed, 13 Sep 2023 20:56:38 +0200
Message-ID: <CA+S=qd2eME9Q1SkG6ovY6Bj_AbXZwmPGJW1kADbgXz=Csu0yLA_at_mail.gmail.com>



Right, the characters with accents and umlauts, they are all 8-bit ascii, so they can't exist in 7-bit ascii.

If they show up fine in the old system, I am guessing it is like this:

  • The old database is US7ASCII.
  • The client is using NLS_LANG (either environment variable or windows registry) with value AMERICAN_AMERICA.US7ASCII . (Or alternatively the client has *no *NLS_LANG set anywhere, which makes database think the client is using US7ASCII.)
  • When the database *thinks *the client is using US7ASCII and the database *also *is running US7ASCII, *no *conversion takes place. The database trusts that the client is sending only US7ASCII characters and stores them without any conversion.
  • But the client has been lying to the database, the client has actually stored *8-bit* ascii characters. Because the database *thought *it was 7-bit, the 8-bit characters *are *stored (wrongly) with all 8 bits in the database.
  • When the client queries the data, again no conversion takes place - the database thinks it has stored 7-bit and the client is expecting 7-bit, so it passes data unconverted to the client. This means that the client on the old system gets the original 8-bit characters and shows them correctly, *even though they were invalid in US7ASCII* !
  • With the database link, the two databases see that one is US7ASCII, the other AL32UTF8, so when copying over the database link, a conversion from US7ASCII to AL32UTF8 will take place. BUT!! Those characters that are 8-bit (and therefore invalid), the database cannot convert them, so it turns it into a question mark.

The old database *should *really have had an 8-bit characterset, it should not have been US7ASCII, it should have been for example WE8ISO8859P1 or maybe USPC437 or whatever encoding the client *actually *was using when it was inserting the data into the database. Then the conversion over the database link would have been correct.

As it is, the big problem is that whenever you access those clobs as *character* data in any *other way *than through a client with wrong or missing NLS_LANG setup, the database will try to interpret the data as if they were encoded in US7ASCII and thereby destroy all 8-bit characters as they do not belong in US7ASCII.

Probably the text needs to be *extracted* using client with wrong or missing NLS_LANG into files, and then re-inserted in the new database using a client with *correct *NLS_LANG setting (WE8ISO8859P1, USPC437, or whatever fits the data) to allow the correct character conversion to take place. (All of the 8-bit characters with accents and umlauts will be converted to 2-byte UTF characters.)

Alternatively there are ways to convert to the correct 8-bit encoding, but that is not possible when the database characterset is 7-bit - the 8-bit special characters just can't be handled *as text *in a database with a 7-bit characterset. The only reason it has worked until now is that the same faulty configured client is used both to insert and query.

Possibly some "tricks" could be made by casting the CLOB to a BLOB - I am not sure, but perhaps that would keep the faulty 8-bit values - copy the BLOB to the new database, and there convert the BLOB to CLOB using the real 8-bit characterset.

Cheerio
/Kim

Regards

Kim Berg Hansen
Oracle ACE Director

Author of Practical Oracle SQL
<https://www.apress.com/gp/book/9781484256169> http://www.kibeha.dk
kibeha_at_kibeha.dk
_at_kibeha
<http://twitter.com/kibeha>

On Wed, Sep 13, 2023 at 5:16 PM Scott Canaan <srcdco_at_rit.edu> wrote:

> Special characters are those in languages other than English with accents,
> umlauts, ect. The customer says they show up fine in the “old” system, but
> are replaced by “?” in the “new” system.
>
>
>
> *Scott Canaan ‘88*
>
> *Sr Database Administrator *Information & Technology Services
> Finance & Administration
>
>
> *Rochester Institute of Technology *o: (585) 475-7886 | f: (585) 475-7520
>
> *srcdco_at_rit.edu <srcdco_at_rit.edu>* | c: (585) 339-8659
>
> *CONFIDENTIALITY NOTE*: The information transmitted, including
> attachments, is intended only for the person(s) or entity to which it is
> addressed and may contain confidential and/or privileged material. Any
> review, retransmission, dissemination or other use of, or taking of any
> action in reliance upon this information by persons or entities other than
> the intended recipient is prohibited. If you received this in error, please
> contact the sender and destroy any copies of this information.
>
>
>
> *From:* Kim Berg Hansen <kibeha_at_gmail.com>
> *Sent:* Wednesday, September 13, 2023 11:13 AM
> *To:* Scott Canaan <srcdco_at_rit.edu>
> *Cc:* Pierre Labrousse <Pierre.Labrousse_at_digora.com>;
> oracle-l_at_freelists.org
> *Subject:* Re: CLOB's and Character Sets
>
>
>
> What special characters are going wrong?
>
> When I hear "special characters" I usually think of characters with ascii
> value > 127, but those should not exist in US7ASCII.
>
>
>
> *If *there are characters with ascii value > 127 (8th bit set) and it is *those
> *characters that are going wrong, then the cause is that such characters *should
> not* exist in US7ASCII and the conversion that is happening from 7-bit
> ASCII to UTF cannot convert 8-bit characters (as they do not exist in
> US7ASCII.)
>
>
>
> It's just a guess, but it's a possibility that they've somehow gotten
> 8-bit ASCII characters even though such ASCII values do not exist in
> US7ASCII. It can happen if clientside NLS_LANG is incorrectly setup,
> similar to this case:
> https://www.kibeha.dk/2018/05/corrupting-characters-how-to-get.html
>
>
>
> Cheerio
>
> /Kim
>
>
>
>
>
> Regards
>
>
>
> Kim Berg Hansen
>
> Oracle ACE Director
>
>
>
> Author of Practical Oracle SQL
> <https://www.apress.com/gp/book/9781484256169>
>
> http://www.kibeha.dk
>
> kibeha_at_kibeha.dk
>
> _at_kibeha
> <http://twitter.com/kibeha>
>
>
>
>
>
> On Wed, Sep 13, 2023 at 5:03 PM Scott Canaan <srcdco_at_rit.edu> wrote:
>
> The customer was trying to do an insert with a subselect over a database
> link.
>
>
>
> *Scott Canaan ‘88*
>
> *Sr Database Administrator *Information & Technology Services
> Finance & Administration
>
>
> *Rochester Institute of Technology *o: (585) 475-7886 | f: (585) 475-7520
>
> *srcdco_at_rit.edu <srcdco_at_rit.edu>* | c: (585) 339-8659
>
> *CONFIDENTIALITY NOTE*: The information transmitted, including
> attachments, is intended only for the person(s) or entity to which it is
> addressed and may contain confidential and/or privileged material. Any
> review, retransmission, dissemination or other use of, or taking of any
> action in reliance upon this information by persons or entities other than
> the intended recipient is prohibited. If you received this in error, please
> contact the sender and destroy any copies of this information.
>
>
>
> *From:* Pierre Labrousse <Pierre.Labrousse_at_digora.com>
> *Sent:* Wednesday, September 13, 2023 10:56 AM
> *To:* oracle-l_at_freelists.org; Scott Canaan <srcdco_at_rit.edu>
> *Subject:* RE: CLOB's and Character Sets
>
>
>
> Hello Scott,
>
>
>
> Since 8.1.5, CLOB columns are stored internally with AL16UTF16 charset, no
> relation with NLS_CHARSET of the database (cf MOS doc : 257772.1).
>
>
>
> Wich tool do you used to do your transfert beween the two databases ?
>
>
>
> If you use expdp/impdp to transfert tables with CLOB column, it will work.
>
>
>
> Regards.
>
> Pierre
>
>
>
>
>
> *Pierre **LABROUSSE*
> * Consultant DBA ORACLE (OCM 10g/11g/12c)*
>
> *M*obile +33 (0)7 56 05 27 38
>
> *pierre.labrousse**_at_**digora.co*
> <https://owa.digora.com/owa/redir.aspx?C=6cU4SWG0HEauWwQa74AWhWf917AeKdIIQ087knyB2eMPFQyUFrMl2EZ1Rw0CEMwBfg1hQ0wI9Zc.&URL=mailto%3apierre.labrousse%40digora.com%2520%250d>
> *m*
> <https://owa.digora.com/owa/redir.aspx?C=6cU4SWG0HEauWwQa74AWhWf917AeKdIIQ087knyB2eMPFQyUFrMl2EZ1Rw0CEMwBfg1hQ0wI9Zc.&URL=mailto%3apierre.labrousse%40digora.com>
>
>
> ------------------------------
>
> *De :* oracle-l-bounce_at_freelists.org <oracle-l-bounce_at_freelists.org> de
> la part de Scott Canaan <srcdco_at_rit.edu>
> *Envoyé :* mercredi 13 septembre 2023 16:42
> *À :* oracle-l_at_freelists.org <oracle-l_at_freelists.org>
> *Objet :* CLOB's and Character Sets
>
>
>
> I have a customer that is trying to migrate CLOB data from one database to
> another. The source database was created using the US7ASCII characterset.
> The destination database was created using the AL32UTF8 characterset. The
> issue is that special characters are not transferring properly. In the
> destination database, all she sees is “?” in place of the character.
>
>
>
> Other than rebuilding the new database with the US7ASCII characterset,
> which will put the project way behind schedule, how can this be fixed? I
> was under the impression that the AL32UTF8 characterset was a superset of
> US7ASCII and should be able to handle the special characters.
>
>
>
>
> *Scott Canaan ‘88 *
> *Sr Database Administrator *Information & Technology Services
> Finance & Administration
>
>
> *Rochester Institute of Technology *o: (585) 475-7886 | f: (585) 475-7520
>
> *srcdco_at_rit.edu <srcdco_at_rit.edu>* | c: (585) 339-8659
>
> *CONFIDENTIALITY NOTE*: The information transmitted, including
> attachments, is intended only for the person(s) or entity to which it is
> addressed and may contain confidential and/or privileged material. Any
> review, retransmission, dissemination or other use of, or taking of any
> action in reliance upon this information by persons or entities other than
> the intended recipient is prohibited. If you received this in error, please
> contact the sender and destroy any copies of this information.
>
>
>
>

--
http://www.freelists.org/webpage/oracle-l
Received on Wed Sep 13 2023 - 20:56:38 CEST

Original text of this message