Loading Unicode Characters with Oracle SQL Loader (sqlldr) results in question marks
Asked Answered
I

5

6

I'm trying to load localized strings from a unicode (UTF8-encoded) csv using SQL Loader into an oracle database. I've tried all sort of combinations but nothing seems to give me the result I'm looking for which is to have special greek characters like (Δ) not get converted to Δ or ¿.

My table definition looks like this:

CREATE TABLE "GLOBALIZATIONRESOURCE"
(
    "RESOURCETYPE" VARCHAR2(255 CHAR) NOT NULL ENABLE,
    "CULTURE"      VARCHAR2(20 CHAR) NOT NULL ENABLE,
    "KEY"          VARCHAR2(128 CHAR) NOT NULL ENABLE,
    "VALUE"        VARCHAR2(2048 CHAR),
    "DESCRIPTION"  VARCHAR2(512 CHAR),
    CONSTRAINT "PK_GLOBALIZATIONRESOURCE" PRIMARY KEY ("RESOURCETYPE","CULTURE","KEY") USING INDEX TABLESPACE REPSPACE_IX ENABLE
)
TABLESPACE REPSPACE; 

I have tried the following configurations in my control file (and actually every permutation I could think of)

load data
TRUNCATE
INTO TABLE "GLOBALIZATIONRESOURCE"
FIELDS TERMINATED BY "," OPTIONALLY ENCLOSED BY '"'
TRAILING NULLCOLS
(   
    "RESOURCETYPE" CHAR(255), 
    "CULTURE" CHAR(20), 
    "KEY" CHAR(128), 
    "VALUE" CHAR(2048), 
    "DESCRIPTION" CHAR(512)
)

load data
CHARACTERSET UTF8
TRUNCATE
INTO TABLE "GLOBALIZATIONRESOURCE"
FIELDS TERMINATED BY "," OPTIONALLY ENCLOSED BY '"'
TRAILING NULLCOLS
(   
    "RESOURCETYPE" CHAR(255), 
    "CULTURE" CHAR(20), 
    "KEY" CHAR(128), 
    "VALUE" CHAR(2048), 
    "DESCRIPTION" CHAR(512)
)

load data
CHARACTERSET UTF16
TRUNCATE
INTO TABLE "GLOBALIZATIONRESOURCE"
FIELDS TERMINATED BY X'002c' OPTIONALLY ENCLOSED BY X'0022'
TRAILING NULLCOLS
(   
    "RESOURCETYPE" CHAR(255), 
    "CULTURE" CHAR(20), 
    "KEY" CHAR(128), 
    "VALUE" CHAR(2048), 
    "DESCRIPTION" CHAR(512)
)

With the first two options, the unicode characters don't get encoded and just show up as upside down question marks.

If I choose last option, UTF16, then I get the following error even though all my data in my fields are much shorter than the length specified.

Field in data file exceeds maximum length

It seems as though every possible combination of ctl file configurations (even setting the byte order to little and big) doesn't work correctly. Can someone please give an example of a configuration (table structure and CTL file) that correctly loads unicode data from a csv? Any help would be greatly appreciated.

Note: I've already been to http://docs.oracle.com/cd/B19306_01/server.102/b14215/ldr_concepts.htm, http://docs.oracle.com/cd/B10501_01/server.920/a96652/ch10.htm and http://docs.oracle.com/cd/B10501_01/server.920/a96652/ch10.htm.

Immigrant answered 6/12, 2011 at 19:56 Comment(5)
What is the database character set? Assuming the database supports Unicode, what tool are you using to query the data and what are the client NLS settings? Are you sure the client and query tool support the Greek characters you're trying to load? Have you used the DUMP function to determine whether the data in the database is actually properly encoded?Gambrill
I found out the problem after a lot of trial and error. I switched to NVARCHAR2 as my database column type and CHARACTERSET UTF8 in my control file which handles unicode text perfectly.Immigrant
@philirabin - That implies that your database character set doesn't support Unicode. If you use NVARCHAR2 data types (assuming your national character set supports Unicode), make sure that all your downstream applications support NVARCHAR2 data types. Using the national character set rather than the database character set introduces a decent amount of complexity.Gambrill
Can you elaborate on what those complexities are? And to give you some background, we're using this table as a replacement to resx files. We building a multi-tenant app and each client has their own strings configured and could vary between clients and languages. In other words, ClientX->SectionTitle->French could be different than ClientY->SectionTitle->French.Immigrant
What language(s) and framework(s) are you building your applications in? The complexities tend to depend on the languages and frameworks. Are you building a new application? Is this a new database? Or are you globalizing an established application?Gambrill
C
4

I had same issue and resolved by below steps -

  1. Open data file into Notepad++ , Go to "Encoding" dropdown and select UTF8 encoding and save file.
  2. use CHARACTERSET UTF8 into CTL file and then upload data.
Cornie answered 26/11, 2020 at 19:54 Comment(0)
R
3

You have two problem;

  1. Character set.

Answer: You can solve this problem by finding your text character set (most of time notepad++ can do this.). After finding character set, you have to find sqlldr correspond of character set name. So, you can find this info from link https://docs.oracle.com/cd/B10501_01/server.920/a96529/appa.htm#975313 After all of these, you should solve character set problem.

  1. In contrast to your actual data length, sqlldr says that, Field in data file exceeds maximum length.

Answer: You can solve this problem by adding CHAR(4000) (or what the actual length is) to problematic column. In my case, the problematic column is "E" column. Example is below. In my case I solved my problem in this way, hope helps. LOAD DATA CHARACTERSET UTF8 -- This line is comment -- Turkish charset (for ÜĞİŞ etc.) -- CHARACTERSET WE8ISO8859P9 -- Character list is here. -- https://docs.oracle.com/cd/B10501_01/server.920/a96529/appa.htm#975313 INFILE 'data.txt' "STR '~|~\n'" TRUNCATE INTO TABLE SILTAB FIELDS TERMINATED BY '#' TRAILING NULLCOLS ( a, b, c, d, e CHAR(4000) )

Reluctant answered 28/12, 2016 at 19:38 Comment(0)
M
2

There is a range of character set encoding that you can use in control file while loading data from sql loader.

For greek characters I believe Western European char set should do the trick.

LOAD DATA
CHARACTERSET WE8ISO8859P1

or in case of MS word input files with smart characters try in control file

LOAD DATA
CHARACTERSET WE8MSWIN1252
Mckenney answered 29/4, 2016 at 16:4 Comment(0)
R
1

You must ensure that the following charactersets are the same:

  1. db characterset
  2. dump file characterset
  3. the client from which you are doing the import (NLS_LANG)

If the client-side characterset is different, oracle will attempt to perform character conversions to the native db characterset and this might not always provide the desired result.

Rainproof answered 8/12, 2011 at 7:10 Comment(1)
Can anybody explain what is this characterset and how to identify some text to which characterset it belongs to?Prosimian
F
1

Don't use MS Office to save the spreadsheet into unicode .csv. Instead, use OpenOffice to save into unicode-UTF8 .csv file. Then in the loader control file, add "CHARACTERSET UTF8" run Oracle SQL*Loader, this gives me correct results

Fleeta answered 3/4, 2015 at 16:37 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.