Reversing Unicode text is tricky for a lot of reasons.
First, depending on the programming language, strings are represented in different ways, either as a list of bytes, a list of UTF-16 code units (16 bits wide, often called "characters" in the API), or as ucs4 code points (4 bytes wide).
Second, different APIs reflect that inner representation to different degrees. Some work on the abstraction of bytes, some on UTF-16 characters, some on code points. When the representation uses bytes or UTF-16 characters, there are usually parts of the API that give you access to the elements of this representation, as well as parts that perform the necessary logic to get from bytes (via UTF-8) or from UTF-16 characters to the actual code points.
Often, the parts of the API performing that logic and thus giving you access to the code points have been added later, as first there was 7 bit ascii, then a bit later everybody thought 8 bits were enough, using different code pages, and even later that 16 bits were enough for unicode. The notion of code points as integer numbers without a fixed upper limit was historically added as the fourth common character length for logically encoding text.
Using an API that gives you access to the actual code points seems like that's it. But...
Third, there are a lot of modifier code points affecting the next code point or following code points. E.g. there's a diacritic modifier turning a following a into an Γ€, e to Γ«, &c. Turn the code points around, and aΓ« becomes eΓ€, made of different letters. There is a direct representation of e.g. Γ€ as its own code point but using the modifier is just as valid.
Fourth, everything is in constant flux. There are also a lot of modifiers among the emoji, as used in the example, and more are added every year. Therefore, if an API gives you access to the information whether a code point is a modifier, the version of the API will determine whether it already knows a specific new modifier.
Unicode provides a hacky trick, though, for when it's only about the visual appearance:
There are writing direction modifiers. In the case of the example, left-to-right writing direction is used. Just add a right-to-left writing direction modifier at the beginning of the text and depending on the version of the API / browser, it will look correctly reversed π
'\u202e' is called right to left override, it is the strongest version of the right to left marker.
See this explanation by w3.org
const text = 'Hello worldπ©βπ¦°π©βπ©βπ¦βπ¦'
console.log('\u202e' + text)
const text = 'Hello worldπ©βπ¦°π©βπ©βπ¦βπ¦'
let original = document.getElementById('original')
original.appendChild(document.createTextNode(text))
let result = document.getElementById('result')
result.appendChild(document.createTextNode('\u202e' + text))
body {
font-family: sans-serif
}
<p id="original"></p>
<p id="result"></p>
οΏ½
characters, and then there's an extra connecting character, which is charcode 8205, and then there's another two οΏ½ that represent "red hair", and those 5 characters together mean 'womans face with red hair' β Shurwoodarr1 = text.split('');
andarr2 = [...text];
give different arrays, witharr2
having the two οΏ½ correctly combined to a single emoji. If you were going to go about using the logic to combine based on charcode 8205, I would use the latter syntax, [...text], as it will be easier to keep the combinations in order β Shurwoodstr.charCodeAt(0)
-- the argument is the index of the character β Shurwoodö
character or a combination ofo
+ Umlaut is that, when Unicode was created, they wanted to include every character from every existing widely-used character set. ISO8859-1 and others include theö
, so it was added even though it is redundant. For the same reason we have both the Latin o and the Greek omicron even though they have the same glyph, but without having both of them, it would be impossible to convert a document with β¦ β UlaniText
datatype which works on graphemes. For what my employer is using JS, even theString
support there is currently is complete overkill. On the other hand, we could desperately need support for units of measure, two-dimensional boolean arrays, enums, and various kinds of timestamps and timespans in nanosecond, microsecond, millisecond, sample, and frame resolution. However, this, in turn doesn't make sense for JS as used as the query, data definition, and schema definition language for CouchDB. β Ulani