Why do those Thai characters display on the web page with a long tail?

Asked 19/8, 2011 at 8:48 Answered 19/5, 2016 at 11:50

ด้้้้้็็็็็้้้้้็็็็็้้้้้็็็็็้้้้้็็็็็้้้้้็็็็็้้้้้็็็็็้้้้้็็็็็้้้้้дด็็็็็้้้้้็็็็้้้้้็็็็็้้้้้็็็็็้้้้้็็็็็้้้้้

I found some interesting characters just as I pasted above which takes only 3 spaces width. However the actual length of the string is 380.

I inspected the string in python, and the string encode is as following:

'\xe0\xb8\x94\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xd0\xb4\xe0\xb8\x94\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89'

It seems that the string is a combination of three thai character:

ด \xe0\xb8\x94  THAI CHARACTER DO DEK

้  \xe0\xb9\x89  THAI CHARACTER MAI THO

็  \xe0\xb9\x87  THAI CHARACTER MAITAIKHU

And my questions are:

Why do those characters behavior so differently, is it a bug?
how can I avoid it in the site(perhaps with some html filter)?

UPDATE

I've tested the characters with more browsers, and the long tail only appears in chrome and firefox on the windows platform.

Following are screenshot I've taken:

win 7 ie8

ubuntu firefox

win 7 chrome

win 7 firefox

Therefore, I guess it is a browser related bug.

Syst answered 19/8, 2011 at 8:48 Comment(8)

LOL, at the characters above:P – Elegit 19/8, 2011 at 8:49

Brilliant, I'd like to know why they display like that too. – Aleciaaleck 19/8, 2011 at 8:52

Programming-related how exactly? – Threepence 19/8, 2011 at 8:56

@paxdiablo, I think it is related with the unicode system, it should be something programmers need to understand especially for web front-end engineers. – Andesite 19/8, 2011 at 8:58

I wondered why I didn't really understand what was going on. Doesn't happen on Windows XP in IE8 or Chrome. – Milker 19/8, 2011 at 9:35

@alexmuller, I realized the display effect is related with os and browsers and I pasted some of my test results. :) – Andesite 19/8, 2011 at 9:38

Some systems have font rendering bugs that show up nastily with poorly-formed input. In other news, water is wet, the Pope is Catholic, and bears excrete in the woods. – Carlycarlye 19/8, 2011 at 12:17

They appear on IE9 on Win 7 too. – Neutral 26/4, 2012 at 14:21

There are two problem, one in the output system (font renderer) which is not Thai aware and one in the input system that generated this text in the first place.

If you had done your homework, you would know that mai tho and maitaikhu (UniCode names) are what UniCode refers to as Non Spacing Markers (NSM). This means that the font renderer should not move to the next character cell when displaying this glyph.

In order to avoid the mess you see above, the Thai API Consortium (TAPIC) made the WTT 2.0 standard that describes both how the font rendering algorithm should handle Thai letter order when it receives it as input and also how the input method should allow such characters to be input if you attempt to type them.

Standardization and Implementations of Thai Language Overview

libthai includes both input and output methods.

thaicheck is a small program that can detect letter sequence problems and fix them.

By the way, you cannot have a sequence (word) of do dek, mai tho and maitaikhu; the input sequence is noise.

Bear in mind that some editors have broken input methods that allow typing multiple NSM that cannot be combined but the output method will render only legal sequences; the result being an illegal input string that looks OK to the user on his system.

Manriquez answered 19/8, 2011 at 10:19 Comment(4)

if everybody "had done their homework", we would not need stackoverflow – Freeload 19/8, 2011 at 10:25

I thought it was considered polite to try to find the answer to your problem before posting it here. – Manriquez 19/8, 2011 at 18:39

I've done some homework, but I am a newbie to Thai characters therefore I could not point out how to google it. And that is the reason why I think SO is awesome. – Andesite 20/8, 2011 at 1:56

The most elementary study of the Thai language or Thai texts would have shown that these diacritic marks must be combined with other characters. I really think you should have a good look at Thai texts if you really want to solve problems like this. – Manriquez 20/8, 2011 at 12:59

The codes you mention are all in UTF-8, which is why each character needs 3 bytes. The respectice Unicode codes are:

DO DEK 0x0e14
MAI THO 0x0e49
MAITAIKHU 0x0e47

The latter two are in the category Mark, Nonspacing, and have the Combine property (Canonical_Combining_Class) set to 107, meaning that the code points are combined with the preceding code point in rendering.

You example starts with a single character and adds lots of nonspacing marks on top of it.

Compare with this C# code:

char DODEK = (char)0x0e14;
char MAITHO = (char)0x0e49;
char MAITAIKHU = (char)0x0e47;

string thai = new string(new char[] { DODEK, MAITHO, MAITAIKHU });
Console.WriteLine("number of code points: " + thai.Length);

var si = new System.Globalization.StringInfo(thai);
Console.WriteLine("number of text elements: " + si.LengthInTextElements);

Output:

number of code points: 3
number of text elements: 1

Recommended topics

Hot tags