Reading an UTF-8 encoded text file in Mathematica

Short version: Mathematica's UTF-8 functionality does not work for character codes with more than 16 bits. Use UTF-16 encoding instead, if possible. But be aware that Mathematica's treatment of 17+ bit character codes is generally buggy. The long version follows...

As noted by numerous commenters, the problem appears to be with Mathematica's support for Unicode characters whose codes are larger than 16 bits. The first such character in the cited text file is U+20B9B (𠮛) which appears on line 10.

Some versions of the Mathematica front-end (like 8.0.1 on 64-bit Windows 7) can handle the character in question when entered directly:

In[1]:= $c="𠮛";

But we run into trouble if we attempt to create the character from its Unicode:

In[2]:= 134043 // FromCharacterCode

During evaluation of In[2]:= FromCharacterCode::notunicode:
A character code, which should be a non-negative integer less
than 65536, is expected at position 1 in {134043}. >>
Out[2]= FromCharacterCode[134043]

One then wonders, what does Mathematica think the code is for this character?

In[3]:= $c // ToCharacterCode
        BaseForm[%, 16]
        BaseForm[%, 2]

Out[3]= {55362,57243}
Out[4]//BaseForm= {d842, df9b}
Out[5]//BaseForm= {1101100001000010, 1101111110011011}

Instead of a single Unicode value as one might expect, we get two codes which happen to match the UTF-16 representation of that character. Mathematica can perform the inverse transformation as well:

In[6]:= {55362,57243} // FromCharacterCode

Out[6]= 𠮛

What, then, is Mathematica's conception of the UTF-8 encoding of this character?

In[7]:= ExportString[$c, "Text", CharacterEncoding -> "UTF8"] // ToCharacterCode
        BaseForm[%, 16]
        BaseForm[%, 2]

Out[7]= {237,161,130,237,190,155}
Out[8]//BaseForm= {ed, a1, 82, ed, be, 9b}
Out[9]//BaseForm= {11101101, 10100001, 10000010, 11101101, 10111110, 10011011}

The attentive reader will spot that this is the UTF-8 encoding of the UTF-16 encoding of the character. Can Mathematica decode this, um, interesting encoding?

In[10]:= ImportString[
           ExportString[{237,161,130,237,190,155}, "Byte"]
         , "Text"
         , CharacterEncoding -> "UTF8"
         ]

Out[10]= 𠮛

Yes it can! But... so what?

How about the real UTF-8 expression of this character:

In[11]:= ImportString[
           ExportString[{240, 160, 174, 155}, "Byte"]
         , "Text"
         , CharacterEncoding -> "UTF8"
         ]
Out[11]= $CharacterEncoding::utf8: The byte sequence {240} could not be
interpreted as a character in the UTF-8 character encoding. >>
$CharacterEncoding::utf8: The byte sequence {160} could not be
interpreted as a character in the UTF-8 character encoding. >>
$CharacterEncoding::utf8: The byte sequence {174} could not be
interpreted as a character in the UTF-8 character encoding. >>
General::stop: Further output of $CharacterEncoding::utf8 will be suppressed
during this calculation. >>
ð ®

... but we see the failure reported in the original question.

How about UTF-16? UTF-16 is not on the list of valid character encodings, but "Unicode" is. Since we have already seen that Mathematica seems to use UTF-16 as its native format, let's give it a whirl (using big-endian UTF-16 with a byte-order-mark):

In[12]:= ImportString[
           ExportString[
             FromDigits[#, 16]& /@ {"fe", "ff", "d8", "42", "df", "9b"}
             , "Byte"
           ]
         , "Text"
         , CharacterEncoding -> "Unicode"
         ]
Out[12]= 𠮛

It works. As a more complete experiment, I re-encoded the cited text file from the question into UTF-16 and imported it successfully.

The Mathematica documentation is largely silent on this subject. It is interesting to note that mention of Unicode in Mathematica appears to be accompanied by the assumption that character codes contain 16 bits. See, for example, references to Unicode in Raw Character Encodings.

The conclusion to be drawn from this is that Mathematica's support for UTF-8 transcoding is missing/buggy for codes longer than 16 bits. UTF-16, the apparent internal format of Mathematica, appears to work correctly. So that is a work-around if you are in a position to re-encode your files and you can accept that the resulting strings will actually be in UTF-16 format, not true Unicode strings.

Postscript

A little while after writing this response, I attempted to re-open the Mathematica notebook that contains it. Every occurrence of the problematic character in the notebook had been wiped out and replaced with gibberish. I guess there are yet more Unicode bugs to iron out, even in Mathematica 8.0.1 ;)

Recommended topics

Hot tags