Reading an UTF-8 encoded text file in Mathematica
Asked Answered
I

1

15

How can I read a utf-8 encoded text file in Mathematica?

This is what I'm doing now:

text = Import["charData.txt", "Text", CharacterEncoding -> "UTF8"];

but it tells me that

$CharacterEncoding::utf8: "The byte sequence {240} could not be interpreted as a character in the UTF-8 character encoding"

and so on. I am not sure why. I believe the file is valid utf-8.

Here's the file I'm trying to read:

http://dl.dropbox.com/u/38623/charData.txt

Infidelity answered 8/4, 2011 at 15:6 Comment(7)
The tenth line, containing ":Ba" has characters that look different in Mathematica, Safari, and TextEdit/BBEdit. (I'm not going to interpret this, just pointing it out.)Isometric
Looks like it doesn't support 4-byte UTF-8 sequence. Likely a bug.Widower
@Brett, ditto TextMate / ChromeMadra
Also, iconv -f UTF-8 charData.txt > /dev/null didn't give any errors, so I think the file likely is valid UTF-8 though.Madra
It is interesting to note, $CharacterEncoding on my machine is set to UTF8 by default, and when I Import the text, I don't get an error. However, the text has been obviously misinterpreted. Another bug?Resound
I pasted the text of a chinese website in notepad, exported to UTF-8 text and imported in Mathematica without a problem. Guess it must be some uncommon subset of UTF-8 that's causing this. Could you try reducing the file size until you have found the culprit?Extraction
As Brett and Kenny pointed it out, one of the culprit lines is line 10, "畐畐:Ba(𠮛田,𠮛田)", in particular the character 𠮛, which has code point 0x00020b9b and requires 4 bytes to represent in UTF8 (or 2 units in UTF16). It appears Mathematica doesn't support characters above a certain code point, not even for copying and pasting into the front end.Infidelity
S
11

Short version: Mathematica's UTF-8 functionality does not work for character codes with more than 16 bits. Use UTF-16 encoding instead, if possible. But be aware that Mathematica's treatment of 17+ bit character codes is generally buggy. The long version follows...

As noted by numerous commenters, the problem appears to be with Mathematica's support for Unicode characters whose codes are larger than 16 bits. The first such character in the cited text file is U+20B9B (𠮛) which appears on line 10.

Some versions of the Mathematica front-end (like 8.0.1 on 64-bit Windows 7) can handle the character in question when entered directly:

In[1]:= $c="𠮛";

But we run into trouble if we attempt to create the character from its Unicode:

In[2]:= 134043 // FromCharacterCode

During evaluation of In[2]:= FromCharacterCode::notunicode:
A character code, which should be a non-negative integer less
than 65536, is expected at position 1 in {134043}. >>
Out[2]= FromCharacterCode[134043]

One then wonders, what does Mathematica think the code is for this character?

In[3]:= $c // ToCharacterCode
        BaseForm[%, 16]
        BaseForm[%, 2]

Out[3]= {55362,57243}
Out[4]//BaseForm= {d842, df9b}
Out[5]//BaseForm= {1101100001000010, 1101111110011011}

Instead of a single Unicode value as one might expect, we get two codes which happen to match the UTF-16 representation of that character. Mathematica can perform the inverse transformation as well:

In[6]:= {55362,57243} // FromCharacterCode

Out[6]= 𠮛

What, then, is Mathematica's conception of the UTF-8 encoding of this character?

In[7]:= ExportString[$c, "Text", CharacterEncoding -> "UTF8"] // ToCharacterCode
        BaseForm[%, 16]
        BaseForm[%, 2]

Out[7]= {237,161,130,237,190,155}
Out[8]//BaseForm= {ed, a1, 82, ed, be, 9b}
Out[9]//BaseForm= {11101101, 10100001, 10000010, 11101101, 10111110, 10011011}

The attentive reader will spot that this is the UTF-8 encoding of the UTF-16 encoding of the character. Can Mathematica decode this, um, interesting encoding?

In[10]:= ImportString[
           ExportString[{237,161,130,237,190,155}, "Byte"]
         , "Text"
         , CharacterEncoding -> "UTF8"
         ]

Out[10]= 𠮛

Yes it can! But... so what?

How about the real UTF-8 expression of this character:

In[11]:= ImportString[
           ExportString[{240, 160, 174, 155}, "Byte"]
         , "Text"
         , CharacterEncoding -> "UTF8"
         ]
Out[11]= $CharacterEncoding::utf8: The byte sequence {240} could not be
interpreted as a character in the UTF-8 character encoding. >>
$CharacterEncoding::utf8: The byte sequence {160} could not be
interpreted as a character in the UTF-8 character encoding. >>
$CharacterEncoding::utf8: The byte sequence {174} could not be
interpreted as a character in the UTF-8 character encoding. >>
General::stop: Further output of $CharacterEncoding::utf8 will be suppressed
during this calculation. >>
ð ®

... but we see the failure reported in the original question.

How about UTF-16? UTF-16 is not on the list of valid character encodings, but "Unicode" is. Since we have already seen that Mathematica seems to use UTF-16 as its native format, let's give it a whirl (using big-endian UTF-16 with a byte-order-mark):

In[12]:= ImportString[
           ExportString[
             FromDigits[#, 16]& /@ {"fe", "ff", "d8", "42", "df", "9b"}
             , "Byte"
           ]
         , "Text"
         , CharacterEncoding -> "Unicode"
         ]
Out[12]= 𠮛

It works. As a more complete experiment, I re-encoded the cited text file from the question into UTF-16 and imported it successfully.

The Mathematica documentation is largely silent on this subject. It is interesting to note that mention of Unicode in Mathematica appears to be accompanied by the assumption that character codes contain 16 bits. See, for example, references to Unicode in Raw Character Encodings.

The conclusion to be drawn from this is that Mathematica's support for UTF-8 transcoding is missing/buggy for codes longer than 16 bits. UTF-16, the apparent internal format of Mathematica, appears to work correctly. So that is a work-around if you are in a position to re-encode your files and you can accept that the resulting strings will actually be in UTF-16 format, not true Unicode strings.

Postscript

A little while after writing this response, I attempted to re-open the Mathematica notebook that contains it. Every occurrence of the problematic character in the notebook had been wiped out and replaced with gibberish. I guess there are yet more Unicode bugs to iron out, even in Mathematica 8.0.1 ;)

Sciurine answered 9/4, 2011 at 17:20 Comment(4)
I am curious, what version of Mathematica do you use, on what platform? With version 8 on Windows XP, the front end cannot handle 𠮛. When pasting it (I can't enter it using the IME), I get a two-character sequence. Also, according to the documentation, the "Unicode" setting will read "raw 2-byte Unicode values", which is mostly in agreement with this (and your other findings).Infidelity
@Infidelity I wrote this up using Mathematica 8.0.1 on Windows 7 64-bit. Just a sec, I'll try MMa 7... (pause) Yup, it works there too for me. However, I just noticed that when I re-opened the notebook that contains this response, the character in question now shows as a "missing character" square. And the notebook no longer functions correctly. Interesting. The bugs run still further...Sciurine
What we can learn from this is that Mathematica can't interpret characters that require 2 units (4 bytes) to encode in UTF-16, however, it leaves these 4-byte sequences intact. Therefore the data can still be exported to other applications (even by copying the string from the front end and pasting it elsewhere!) But during processing the data we need to keep in mind that these characters are going to be treated as two-character sequences, and therefore functions such as StringLength return incorrect results.Infidelity
Here's a function to get the Unicode code point from a pair of UTF-16 code units: toCodePoint[{a_, b_}] /; 16^^d800 <= a <= 16^^dbff && 16^^dc00 <= b <= 16^^dfff := (a - 16^^d800)*2^10 + (b - 16^^dc00) + 16^4. The condition is optional, it verifies that the pair is indeed valid UTF-16.Infidelity

© 2022 - 2024 — McMap. All rights reserved.