What are the most difficult-to-render Unicode samples?
Asked Answered
H

3

13

I'm trying to implement a cross-platform (desktop browsers, iOS, & Android) typography system that allows users to input any Unicode string.

What are some strings I should use to stress-test my system and ensure the most nines of users will have a good experience? Is there a standard or de-facto standard list that I can also use?

Hundley answered 30/12, 2015 at 22:44 Comment(9)
If this is off-topic here, please direct me somewhere I can find my answer.Hundley
Doesn't seem off-topic just a bit too vague for it to be likely you'll get much useful feedback.Lida
@Lida any idea how I could make it more specific?Hundley
well, you say input then talk about rendering, you mention several different platforms all of which their own font rendering and input systems, some of which with limited end-user control. So it's not really obvious what you're doing, what you're trying to achieve, what specific problems you are encountering or hoping to avoid, etc.Lida
I'm creating a view which displays text (that can be supplied by a user) in fancy typographical styles (italic, colored, rotated, centered, etc.). What I want to achieve is ensuring any text the user supplies will render as intended. What I want to avoid is text that is unreadable, or otherwise does not convey the user's intended meaning, solely because of the chosen arrangement of characters.Hundley
There isn't any way to ensure that, in a cross-platform way. Even the samples you have already fail on Chrome OS X, let alone IE or Chrome for Windows and those are just a couple that I tried, although, again, the specifics are unclear. A view in what? A web browser? An app? Etc.Lida
@Lida A custom view in an application. Any rendering problems, I can fix manually. This is why I want to know the toughest problems in Unicode, so I can test for them and fix them.Hundley
+1 from me for the samples you already have. Fascinating to see how well modern browsers and VS Code handle this stuff.Adagietto
@Adagietto thanks! I've separated them out into their own answer, since it seems maybe I already have a good enough list to be a good answerHundley
H
18

Here are some strings that I use in tests like that:

  • Vertically-stacked characters: Z̤͔ͧ̑̓ä͖̭̈̇lͮ̒ͫǧ̗͚̚o̙̔ͮ̇͐̇
  • Right-to-left words: اختبار النص
  • Mixed-direction words: من left اليمين to الى right اليسار
  • Mixed-direction characters: a‭b‮c‭d‮e‭f‮g
  • Very long characters: ﷽﷽﷽﷽﷽﷽﷽﷽﷽﷽﷽﷽﷽﷽﷽﷽
  • Emoji with skintone variations: 👱👱🏻👱🏼👱🏽👱🏾👱🏿
  • Emoji with gender variations: 🧟‍♀️🧟‍♂️
  • Emoji created by combining codepoints: 👨‍❤️‍💋‍👨👩‍👩‍👧‍👦🏳️‍⚧️🇵🇷
Hundley answered 26/7, 2018 at 13:15 Comment(0)
H
2

There are a lot of good examples in the Big List of Naughty Strings:

https://github.com/minimaxir/big-list-of-naughty-strings/blob/master/blns.txt

I cannot include the whole file, but here's a few lines:

#   Unicode Subscript/Superscript/Accents
#
#   Strings which contain unicode subscripts/superscripts; can cause rendering issues


ด้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็ ด้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็ ด้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็็้้้้้้้้็็็็็้้้้้็็็็
#   Two-Byte Characters
#
#   Strings which contain two-byte characters: can cause rendering issues or character-length issues

田中さんにあげて下さい
#   Strings which contain two-byte letters: can cause issues with naïve UTF-16 capitalizers which think that 16 bits == 1 character

𐐜 𐐔𐐇𐐝𐐀𐐡𐐇𐐓 𐐙𐐊𐐡𐐝𐐓/𐐝𐐇𐐗𐐊𐐤𐐔 𐐒𐐋𐐗 𐐒𐐌 𐐜 𐐡𐐀𐐖𐐇𐐤𐐓𐐝 𐐱𐑂 𐑄 𐐔𐐇𐐝𐐀𐐡𐐇𐐓 𐐏𐐆𐐅𐐤𐐆𐐚𐐊𐐡𐐝𐐆𐐓𐐆
#   Special Unicode Characters Union
#
#   A super string recommended by VMware Inc. Globalization Team: can effectively cause rendering issues or character-length issues to validate product globalization readiness.

表ポあA鷗ŒéB逍Üߪąñ丂㐀𠀀
#   Ogham Text
#
#   The only unicode alphabet to use a space which isn't empty but should still act like a space.

᚛ᚄᚓᚐᚋᚒᚄ ᚑᚄᚂᚑᚏᚅ᚜
᚛                 ᚜

#   iOS Vulnerabilities
#
#   Strings which crashed iMessage in various versions of iOS

Powerلُلُصّبُلُلصّبُررً ॣ ॣh ॣ ॣ冗
🏳0🌈️
జ్ఞ‌ా
Hundley answered 13/9, 2022 at 22:59 Comment(0)
P
1

Some others:

  • Reversible characters in Right-to-Left scripts. Ex. Parentheses get reversed for display in Hebrew. Unicode spec has a whole list of these reversible characters.
  • Scripts with letter shaping: Arabic, Hindi, etc.
Phone answered 9/8, 2018 at 15:3 Comment(5)
These sound super fascinating! Do you have any samples?Hundley
Microsoft font development resources seem to have some good examples of script "shaping". They show examples where multiple Unicode characters get assembled into the proper shape for the script. Sorta like turning "ff" into the 'ff' ligature character, but much much more complicated. Indic: learn.microsoft.com/en-us/typography/script-development/… Arabic: learn.microsoft.com/en-us/typography/script-development/arabicPhone
Reversible characters are ones that can be tricky when the context of rendering them changes between left-to-right and right-to-left script. For example, in a left-to-right script (ex. English), an opening bracket is rendered '['. But in a right-to-left script the opening bracket is rendered ']'. Within a single text line with a mixture of L2R and R2L text you have to keep track of current direction in order to draw the correct glyphs amongst the characters which can be rendered blindly (i.e. without consideration for current direction).Phone
Here's an issue with reversible characters in LibreOffice - including some test text strings: ask.libreoffice.org/en/question/18912/…Phone
Those are very insightful indeed! I tried to edit this answer to include some, but it really didn't want me doing that 😜 - if you ever find a way to make that happen, StackOverflow prefers that, so content isn't lost if links rot awayHundley

© 2022 - 2024 — McMap. All rights reserved.