Re: Character set conversion and database size

From: Kenny Payton <k3nnyp_at_gmail.com>
Date: Tue, 18 Aug 2015 19:44:40 -0400
Message-Id: <BB08FC6E-1C79-4B95-810C-6904FBE3B0E6_at_gmail.com>

I would suggest wading into this carefully, there are lots of considerations and planning to do. A couple years back we converted 6 we8iso8859p1 databases to utf8 that were each 20-25T in size at the time and were a nightmarish mix of lossy and convertible data. We had to convert two in each maintenance window and had a number of issues to deal with. I will say we deviated dramatically from the supported approach and it’s not something I would advise to anyone if it could be avoided. Here are a couple things to think about from my experiences.

Read up on csscan and csalter
csscan detects problematic characters
csalter changes the characterset
Read up on the two categories of problematic characters Lossy - characters that are not valid to your characterset. This data needs to be fixed and/or deleted from your database prior to changing the character set. These characters are unknown to your characterset. The code points do not exist. Believe it or not, this is easier than you think. One thing I didn’t think much about before this project was if you have a client with the same characterset settings as your database the database does not need to do any conversion and also doesn’t bother with any validation. If your application is loading other characterset ( possibly unicode/utf8 ) data via this method your database loads it on in. We had a large amount of utf8 data in our database and left it in place. After the conversion the data represented fine since we were converting to utf8. The tricky part was tricking the alter into thinking the scan was clean. I forget the option but there is an option during the scan to provide guesstimates for your lossy data to give you an idea of what character sets they could be part of. We found the majority of our non-utf8 lossy data was we8mswin1252 characters. We converted this data with an internal package we stumbled across SYS_OP_CSCONV. Convertible - these are characters that exist in the current characterset and also exist in the new characterset but have different code points. This data needs to be exported, deleted and later re-imported into the database after the conversion has occurred. Take a look at the Data Migration Assistant for Unicode ( DMU ) It was in its infancy and pretty buggy when we were doing our migrations but it might be better now. http://www.oracle.com/technetwork/database/database-technologies/globalization/dmu/overview/index.html <http://www.oracle.com/technetwork/database/database-technologies/globalization/dmu/overview/index.html>

We had to get pretty creative to meet our change windows with millions of rows of lossy and convertible data and csalter's requirement to have a clean scan prior to changing the characterset. We actually hacked this part. That was not a fun part but seemed pretty much unavoidable with our constraints. Ultimately leveraged a temporary physical standby, a fairly long and complex series of steps and some inhouse developed perl processes for mass parallel processing.

Hopefully your conversion is painless but if you run into some stumbling blocks you can sometimes overcome them with a little creativity and dedication.

Good luck

Kenny

> On Aug 14, 2015, at 3:08 PM, Schauss, Peter (ESS) <peter.schauss_at_ngc.com> wrote:
>
> Oracle 12.1.0.2 on Red Hat Linux.
>
> IT management is pushing us to migrate our databases to the Unicode character set ( NLS_CHARACTERSET= WE8MSWIN1252 to AL32UTF8). One of my internal customers is concerned that conversion to multibyte characters will dramatically increase the size of the database. I know that for VARCHAR2s Oracle only allocates storage for the actual characters stored in the string regardless maximum defined size of the column. Does the same idea apply a VARCHAR2 column whose contents can be coded as single bytes or does "Hello world!" actually take twice as much space in Unicode as it does in the default character set?
>
> Thanks,
> Peter
> --
> http://www.freelists.org/webpage/oracle-l
>
>

--
http://www.freelists.org/webpage/oracle-l

Received on Wed Aug 19 2015 - 01:44:40 CEST

This message: [ Message body ]
Next message: Mark Brinsmead: "Re: awr"
Previous message: Hans Forbrich: "Re: awr"
In reply to: Schauss, Peter (ESS): "Character set conversion and database size"

Contemporary messages sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

Original text of this message