Restricting characters in a UTF8 database [message #603981] |
Sat, 21 December 2013 20:42 |
|
orauser001
Messages: 13 Registered: April 2013 Location: us
|
Junior Member |
|
|
Version Information
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
PL/SQL Release 11.2.0.2.0 - Production
CORE 11.2.0.2.0 Production
TNS for 64-bit Windows: Version 11.2.0.2.0 - Production
NLSRTL Version 11.2.0.2.0 - Production
We are building an application that will store data about companies. It will get feeds from 100+ sources. The database character set is AL32UTF8
The user requirement is that the database should allow storing any 'Latin' and 'Arabic' characters. Looking at Unicode specification (http://www.unicode.org/charts/) the Latin and Arabic characters in Unicode are in the following ranges:
- Basic Latin (ASCII) [0000-007F]
- Latin-1 Supplement [0080-00FF]
- Latin Extended-A [0100-017F]
- Latin Extended-B [0180-024F]
- Latin Extended-C [2C60-2C7F]
- Latin Extended-D [A720-A7FF]
- Arabic [0600-06FF]
- Arabic Supplement [0750-077F]
- Arabic Extended [08A0-08FF]
Questions I have
1. Once we get data from a source its first loaded in a temporary staging table. Is there an easy way to query the staging table to find out if specific
column (e.g. Company Name) have any data that is not covered in the above acceptable Character ranges (so that it can be rejected and not be loaded in the
master tables).
2.Since we would be getting large volumes of such data, the check should ideally work in reasonable amount of time.
3.We need to create a specification document for our data providers. I am wondering what we need to specify in that document - will it suffice to say
that the files should be encoded in UTF8 and the characters should be in the code ranges that our application accepts (specified above)?
Thanks in advance for your help.
|
|
|
|
|