MATLAB: how to display UTF-8-encoded text read from file?
Asked Answered
S

1

25

The gist of my question is this:

How can I display Unicode characters in Matlab's GUI (OS X) so that they are properly rendered?

Details:

I have a table of strings stored in a file, and some of these strings contain UTF-8-encoded Unicode characters. I have tried many different ways (too many to list here) to display the contents of this file in the MATLAB GUI, without success. For example:

>> fid = fopen('/Users/kj/mytable.txt', 'r', 'n', 'UTF-8');
>> [x, x, x, enc] = fopen(fid); enc

enc =

UTF-8

>> tbl = textscan(fid, '%s', 35, 'delimiter', ',');
>> tbl{1}{1}

ans =

ÎÎÎÎÎΠΣΦΩαβγδεζηθικλμνξÏÏÏÏÏÏÏÏÏÏ
>> 

As it happens, if I paste the string directly into the MATLAB GUI, the pasted string is displayed properly, which shows that the GUI is not fundamentally incapable of displaying these characters, but once MATLAB reads it in, it longer displays it correctly. For example:

>> pasted = 'ΓΔΘΛΞΠΣΦΩαβγδεζηθικλμνξπρςστυφχψω'

pasted =


>> 

Thanks!

Shiksa answered 28/7, 2011 at 17:31 Comment(0)
F
34

I present below my findings after doing some digging... Consider these test files:

a.txt

ΓΔΘΛΞΠΣΦΩαβγδεζηθικλμνξπρςστυφχψω

b.txt

தமிழ்

First, we read files:

%# open file in binary mode, and read a list of bytes
fid = fopen('a.txt', 'rb');
b = fread(fid, '*uint8')';             %'# read bytes
fclose(fid);

%# decode as unicode string
str = native2unicode(b,'UTF-8');

If you try to print the string, you get a bunch of nonsense:

>> str
str =

Nonetheless, str does hold the correct string. We can check the Unicode code of each character, which are as you can see outside the ASCII range (last two are the non-printable CR-LF line endings):

>> double(str)
ans =
  Columns 1 through 13
   915   916   920   923   926   928   931   934   937   945   946   947   948
  Columns 14 through 26
   949   950   951   952   953   954   955   956   957   958   960   961   962
  Columns 27 through 35
   963   964   965   966   967   968   969    13    10

Unfortunately, MATLAB seems unable to display this Unicode string in a GUI on its own. For example, all these fail:

figure
text(0.1, 0.5, str, 'FontName','Arial Unicode MS')
title(str)
xlabel(str)

One trick I found is to use the embedded Java capability:

%# Java Swing
label = javax.swing.JLabel();
label.setFont( java.awt.Font('Arial Unicode MS',java.awt.Font.PLAIN, 30) );
label.setText(str);
f = javax.swing.JFrame('frame');
f.getContentPane().add(label);
f.pack();
f.setVisible(true);

enter image description here


As I was preparing to write the above, I found an alternative solution. We can use the DefaultCharacterSet undocumented feature and set the charset to UTF-8 (on my machine, it is ISO-8859-1 by default):

feature('DefaultCharacterSet','UTF-8');

Now with a proper font (you can change the font used in the Command Window from Preferences > Font), we can print the string in the prompt (note that DISP is still incapable of printing Unicode):

>> str
str =
ΓΔΘΛΞΠΣΦΩαβγδεζηθικλμνξπρςστυφχψω

>> disp(str)
ΓΔΘΛΞΠΣΦΩαβγδεζηθικλμνξπÏςστυφχψω

And to display it in a GUI, UICONTROL should work (under the hood, I think it is really a Java Swing component):

uicontrol('Style','text', 'String',str, ...
    'Units','normalized', 'Position',[0 0 1 1], ...
    'FontName','Arial Unicode MS', 'FontSize',30)

enter image description here

Unfortunately, TEXT, TITLE, XLABEL, etc.. are still showing garbage:

enter image description here


As a side note: It is difficult to work with m-file sources containing Unicode characters in the MATLAB editor. I was using Notepad++, with files encoded as UTF-8 without BOM.

Freddyfredek answered 29/7, 2011 at 11:48 Comment(9)
What about hardcoded strings in the editor. Is there a way to make MATLAB editor to use UTF-8?Origan
@nimcap: MATLAB can execute files with hardcoded Unicode strings jus fine (str='...'), as long as you don't edit them in the MATLAB Editor (at least that's the case on my WinXP machine). It could be a configuration issue (language settings and locale?), but I couldn't make it work with the MATLAB IDE... The solution was to use an external editor to write the code. In my case it was Notepad++, I had to specify encoding as UTF-8 without BOM otherwise MATLAB complains about syntax error with an extra character at the beginning of the file.Freddyfredek
@nimcap: forgot to say that you must make sure to call feature('DefaultCharacterSet','UTF-8'); prior to thatFreddyfredek
@Amro: I figured that out, I was asking using MATLAB editor as it is for UTF8 text. I like the autocomplete and other features of the editor.Origan
@nimcap: sorry but I didn't find a way to make it work with the builtin editorFreddyfredek
Wonderful writeup. Fortunately, the situation has dramatically improved in R2013a! Displaying unicode in the command window works by default, and once the UTF-8 character set feature has been selected it also works within handle graphics objects. Unfortunately unicode within string literals is still unsupported. See also bug 312955 at Mathworks: mathworks.com/support/bugreports/312955Chiromancy
@MattB.: thanks for the update. Indeed the command window is displaying the text right (assuming a capable font is used). However, handle graphics are still not showing the text correctly (text, xlabel, ylabel functions), even after setting DefaultCharacterSet feature. I just tested this on WinXP 32-bit using R2013a, maybe the situation is different on other platforms. Good to know that TMW is looking into this issue :)Freddyfredek
@Amro: Ah, interesting. Then it's platform-dependent behavior. I had tested it to work on Mac.Chiromancy
Update: I'm happy to report that MATLAB R2014b (prerelease so far) seems has fixed most of the issues above. The command prompt, Handle graphics (the new HG2), uicontrols, and direct Java Swing components all render Unicode text correctly, without having to change DefaultCharacterSet (I tested it on a Windows machine configured with the default en-US locale). Unfortunately the editor/IDE still choke on non-ASCII encoded source files (stuff like BOM markers are not correctly recognized)Freddyfredek

© 2022 - 2024 — McMap. All rights reserved.