Why does tcl/tkinter only support BMP characters?
Asked Answered
L

0

6

I am trying to query and display utf-8 encoded characters in a gui built on tkinter and thus tcl. However, I have found that tkinter cannot display 4-byte characters i.e. unicode codepoints greater than U+FFFF. Why is this the case? What limitations would implementing non-BMP characters have for tcl?

I can't query non-BMP characters through my gui, but if they come up in a result I can copy/paste the character and see the character/codepoint through unicode-table.com despite my system not displaying it. So, it seems that the character is being displayed as codepoint U+FFFD but stored in the view with the correct codepoint.

I am running a Python 3.6.4 script on Windows 7.

Update: Here is the error I get for some context where the 4-byte unicode codepoint is out of range of BMP characters and can't be handled by Tcl

 File "Project/userInterface.py", line 569, in populate_tree
    iids.append(self.detailtree.insert('', 'end', values=entry))
  File "C:\Program Files (x86)\Python36-32\Lib\tkinter\ttk.py", line 1343, in insert
    res = self.tk.call(self._w, "insert", parent, index, *opts)
_tkinter.TclError: character U+1f624 is above the range (U+0000-U+FFFF) allowed by Tcl

I handle this by using regular expressions to substitute out of range unicode characters with the replacement character.

  for item in entries:
        #handles unicode characters that are greator than 3 bytes as tkinter/tcl cannot handle/display them
        entry = list(item)
        for i, col in enumerate(entry):
            if col and isinstance(col, str):
                re_pattern = re.compile(u'[^\u0000-\uD7FF\uE000-\uFFFF]', re.UNICODE)
                filtered_string = re_pattern.sub(u'\uFFFD', col) #replaces \u1000 and greater with the unknow character
                if filtered_string != col:
                    entry[i] = filtered_string
        entry = tuple(entry)
        iids.append(self.detailtree.insert('', 'end', values=entry))
Logos answered 17/1, 2018 at 21:17 Comment(6)
ask authors why. Create minimal working example with your problem.Archaeopteryx
You could use pillow to solve that problem.Impellent
@Simon Pillow is Python 3's version of python imaging library correct? Here BMP does not mean an image bitmap but Basic Multilingual Plane (Plane 0) which means 3-byte unicode characters. Would pillow be helpful with extending tcl to 4-byte characters?Logos
Ah I misunderstood.Impellent
In Windows, tkinter should support non-BMP characters if you pass UTF-16 surrogate codes in the string. Python allows this with the 'surrogatepass' error handler. I don't think this is possible with UTF-8 in Unix. For example: title_bytes = '\U0001F60A'.encode('utf-16le'); title = ''.join(title_bytes[n:n+2].decode('utf-16le', 'surrogatepass') for n in range(0, len(title_bytes), 2)); root = Tk(); root.title(title); root.mainloop().Corelli
Tkinter only supports the BMP because Tk (the library that Tkinter is a wrapper around) only supports the BMP. That's a known issue with Tk that should be minimally fixed (provided you don't poke too closely) in 8.7. Encoding as surrogate pairs should work for now.Allege

© 2022 - 2024 — McMap. All rights reserved.