How to reverse a string that contains complicated emojis?
Asked Answered
F

10

205

Input:

Hello worldπŸ‘©β€πŸ¦°πŸ‘©β€πŸ‘©β€πŸ‘¦β€πŸ‘¦

Desired Output:

πŸ‘©β€πŸ‘©β€πŸ‘¦β€πŸ‘¦πŸ‘©β€πŸ¦°dlrow olleH

I tried several approaches but none gave me correct answer.

This failed miserablly:

const text = 'Hello worldπŸ‘©β€πŸ¦°πŸ‘©β€πŸ‘©β€πŸ‘¦β€πŸ‘¦';

const reversed = text.split('').reverse().join('');

console.log(reversed);

This kind of works, but it breaks πŸ‘©β€πŸ‘©β€πŸ‘¦β€πŸ‘¦ into 4 different emojis:

const text = 'Hello worldπŸ‘©β€πŸ¦°πŸ‘©β€πŸ‘©β€πŸ‘¦β€πŸ‘¦';

const reversed = [...text].reverse().join('');

console.log(reversed);

I also tried every answer in this question, but none of them worked.

How can I get the desired output?

Fever answered 30/9, 2020 at 11:28 Comment(19)
I can't see the problem with the second solution. What am I missing? – Rosado
So these emojis are actually combinatorial emojis somehow, it's quite interesting. First, you have the womans face emoji, which itself is represented by two of your οΏ½ characters, and then there's an extra connecting character, which is charcode 8205, and then there's another two οΏ½ that represent "red hair", and those 5 characters together mean 'womans face with red hair' – Shurwood
To properly reverse a string with combined emojis would be pretty complicated, I think. You'd have to check if each emoji is followed by charcode 8205, and if it is you'd have to combine it with the previous emoji instead of treating it as it's own character. Pretty complicated... – Shurwood
It's very curious that arr1 = text.split(''); and arr2 = [...text]; give different arrays, with arr2 having the two οΏ½ correctly combined to a single emoji. If you were going to go about using the logic to combine based on charcode 8205, I would use the latter syntax, [...text], as it will be easier to keep the combinations in order – Shurwood
btw, you can check the the charcode of a single-character string by using str.charCodeAt(0) -- the argument is the index of the character – Shurwood
Javascript confuses me. It's the strangest mix of low and high level language concepts. It's level in that it fully abstracts memory (no pointers, manual memory management), but so low level as to treats strings as dumb code points rather than extended grapheme clusters. It's really confusing, and it makes me never know what to expect when working with this thing. – Barrage
@Alexander-ReinstateMonica is there any language that does splitting by grapheme splitting by default? JS just provides standard strings encoded in UTF-16. – Pruchno
Could this perhaps be related to unicode Normalization Form Canonical Composition / Decomposition? If expressed in NFC, does each "complicated emoji" sequence condense into a single code point? #45269195 – Whetstone
@Pruchno I'm not asking for it to be the default necessarily, but I would expect it to be built into the stdlib of a modern language, particularly one designed for front-end use, where internationalization is so important. (And to answer the question: yes, Swift does correct grapheme splitting in all string operations. There are some trade-offs, but correctness is usually more important than the downsides, IMO) – Barrage
@Alexander-ReinstateMonica: JavaScript is explicitly designed as an embeddable scripting language, and I am using the "original" definition of scripting here, as in "programming where most values, types, and operations come from the outside, and you are not in control of the lifetime of most values or even your own program". For a language with that design goal, it makes sense to have a small to non-existent stdlib, so that the types, values, and operations provided by the embedded stdlib do not interfere with the ones provided by the embedding host environment. Remember, you cannot … – Ulani
… even do I/O in JavaScript, there is no way to read or write a file, there isn't even a way to print text to a console. (Which again makes sense, because depending on where JavaScript is embedded, a concept such as "file" may not even exist, and there may not be a console.) JavaScript is used as query language for databases, as configuration language for embedded devices, as extension language for applications. In the REAPER Digital Audio Workstation, it is used to write DSP Algorithms for audio effects. Why would a reverb effect need to know about grapheme clusters? – Ulani
@MarkU: No. The redundant code points are mostly legacy, for reasons of roundtrip-compatibility. E.g. the reason why my name can be spelled both with a single ö character or a combination of o + Umlaut is that, when Unicode was created, they wanted to include every character from every existing widely-used character set. ISO8859-1 and others include the ö, so it was added even though it is redundant. For the same reason we have both the Latin o and the Greek omicron even though they have the same glyph, but without having both of them, it would be impossible to convert a document with … – Ulani
… ISO8859-7 encoding to Unicode and back without information loss, since ISO8859-7 includes both characters. However, the same is not true for Emoji characters. They are a unique Unicode invention, so there is no need for backwards-compatibility with legacy encodings, and providing precomposed characters for every possible combination of gender, clothing, facial expression, skin color, hair color, etc. would be insane. A simple example are the flag emojis. They are simply a character that says "this is a flag" plus the two character ISO3166-1 Alpha-2 country code. – Ulani
Still a work in progress, but maybe we're going to get a native solution in JS for that: github.com/tc39/proposal-intl-segmenter - note: some browsers already have this on their nightly versions. – Gym
@JörgWMittag Requirements have changed. It now runs all client side web code, and a scary amount of server-side code (Node.js, Deno). Internationalization is critical, it needs to do it well, or the web will behave incorrectly for non-english users (and in the case of emoji, even english users will suffer the flaws). – Barrage
@Alexander-ReinstateMonica: In that case, the HTML Spec can specify e.g. a Text datatype which works on graphemes. For what my employer is using JS, even the String support there is currently is complete overkill. On the other hand, we could desperately need support for units of measure, two-dimensional boolean arrays, enums, and various kinds of timestamps and timespans in nanosecond, microsecond, millisecond, sample, and frame resolution. However, this, in turn doesn't make sense for JS as used as the query, data definition, and schema definition language for CouchDB. – Ulani
For an example of what happens if you don't handle this type of situation well, check out a minor issue in Firefox: Create a bookmark that has emojis in the bookmark's title. Now display bookmarks in the sidebar. Finally, resize the sidebar so its width causes it to be at an emoji in the bookmark title. The emoji will get broken up into multiple characters which render incorrectly. – Katheryn
You may be interested in this article. (The third question there is "How do you reverse a Unicode string?" with the answer being, "You can't." That being said, the article deals with arbitrary Unicode. If you're restricting to English characters and emojis there may be hope. :) ) – Echino
@PedroLima πŸ‘©β€πŸ‘©β€πŸ‘¦β€πŸ‘¦ is one combined emoji, yet the individual emojis it consists of gets reversed in the second solution (which wouldn't make sense if you consider it to be one character). – Reikoreilly
S
101

If you're able to, use the _.split() function provided by lodash. From version 4.0 onwards, _.split() is capable of splitting unicode emojis.

Using the native .reverse().join('') to reverse the 'characters' should work just fine with emojis containing zero-width joiners

function reverse(txt) { return _.split(txt, '').reverse().join(''); }

const text = 'Hello worldπŸ‘©β€πŸ¦°πŸ‘©β€πŸ‘©β€πŸ‘¦β€πŸ‘¦';
console.log(reverse(text));
<script src="https://cdnjs.cloudflare.com/ajax/libs/lodash.js/4.17.20/lodash.min.js" integrity="sha512-90vH1Z83AJY9DmlWa8WkjkV79yfS2n2Oxhsi2dZbIv0nC4E6m5AbH8Nh156kkM7JePmqD6tcZsfad1ueoaovww==" crossorigin="anonymous"></script>
Shaina answered 30/9, 2020 at 13:9 Comment(6)
The changelogs you point at mention "v4.9.0 - Ensured _.split works with emojis", I think 4.0 might be too early. The comments in the code that is used to split the strings (github.com/lodash/lodash/blob/4.17.15/lodash.js#L261) refer to mathiasbynens.be/notes/javascript-unicode which is from 2013. It looks like it has moved on since then, but it does use a pretty hard to decipher lot of unicode regexes. I also can't see any tests in their codebase for unicode splitting. All this would make me wary of using it in production. – Rafael
It took only a little searching to find that this fails reverse("뎌쉐") (2 Korean graphemes) which gives "ᅰ셔ᄃ" (3 graphemes). – Rafael
It seems there's no easy native solution for this problem. Wouldn't prefer to import a library just for solving this, but it is indeed the most reliable/consistent way to do it at this point. – Fever
Kudos for getting this to work correctly 😎 Reversing writing direction in Firefox on Windows10 still is a wee tad glitchy (the children end up in rear), so lodash beat Windows 10, I guess, which likely a somewhat lower budget πŸ˜… – Ky
@MichaelAnderson You could apply a preliminary .normalize("NFC") to prevent Hangul letters from being disassembled (and reassembled in a weird way). It merges letters in a syllable block into one code point. – Spermato
@ΠšΠΎΠ½ΡΡ‚Π°Π½Ρ‚ΠΈΠ½Π’Π°Π½ NFC normalisation is not enough in general. The 2 to 3 grapheme case was simply a quick counter-example that reversing lodash's split did not cover all the cases. There's a lot of other combined characters that have no single character equivalent. – Rafael
C
56

I took TKoL's idea of using the \u200d character and used it to attempt to create a smaller script.

Note: Not all compositions use a zero width joiner so it will be buggy with other composition characters.

It uses the traditional for loop because we skip some iterations in case we find combined emoticons. Within the for loop there is a while loop to check if there is a following \u200d character. As long there is one we add the next 2 characters as well and forward the for loop with 2 iterations so combined emoticons are not reversed.

To easily use it on any string I made it as a new prototype function on the string object.

String.prototype.reverse = function() {
  let textArray = [...this];
  let reverseString = "";

  for (let i = 0; i < textArray.length; i++) {
    let char = textArray[i];
    while (textArray[i + 1] === '\u200d') {
      char += textArray[i + 1] + textArray[i + 2];
      i = i + 2;
    }
    reverseString = char + reverseString;
  }
  return reverseString;
}

const text = "Hello worldπŸ‘©β€πŸ¦°πŸ‘©β€πŸ‘©β€πŸ‘¦β€πŸ‘¦";

console.log(text.reverse());

//Fun fact, you can chain them to double reverse :)
//console.log(text.reverse().reverse());
Caddell answered 30/9, 2020 at 12:46 Comment(9)
I was thinking, when you drag and select the text on browsers, πŸ‘©β€πŸ‘©β€πŸ‘¦β€πŸ‘¦ can only be selected as a whole. How do browsers know it's one character? Is there a built-in way to do it? – Fever
@HaoWu I'm not sure but I guess browsers do it a bit in the same way. These "complex" emoticons just seems to be "simple" emoticons with a so called "zero width joiner" character in between. I'm not sure if it is the browser or the character encoder that combines them together. – Caddell
@HaoWu this is what's known as "Unicode Segmentation" on "Grapheme Clusters". Your browser (which may use the one provided by your OS) is going to render and allow selection per grapheme cluster. You can read the spec here: unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries – Pruchno
@HaoWu: "How do browsers know it's one character?" – It's not "one character". It's multiple characters combining to form a single grapheme cluster, rendered as a single glyph. – Ulani
Same as here; not all compositions use a zero width joiner. – Joviality
This doesn't correctly reverse anything but characters composed with ZWJ. Please, not just here but as a general rule, use external libraries written by people who know what they're doing, instead of hacking up bespoke solutions that happen to work for one test case. The runes and lodash libraries were recommended in other answers (I can't vouch for either). – Ivaivah
this will fail with Aΰ€¦ΰ₯‡ (U+0926) for example – Stanleigh
@Ivaivah I'm all for using existing, tested implementations. But I'm not a big fan of including a whole library if I only need one function from said library. I'd much rather use a modular design that adds only the functions I need, and to serve that code locally, and not actually externally. – Bedouin
@Bedouin - If you import a library written using modules and using a modern compiler, then I would hope that tree shaking would solve the issue of dead code for you. – Neoimpressionism
K
47

Reversing Unicode text is tricky for a lot of reasons.

First, depending on the programming language, strings are represented in different ways, either as a list of bytes, a list of UTF-16 code units (16 bits wide, often called "characters" in the API), or as ucs4 code points (4 bytes wide).

Second, different APIs reflect that inner representation to different degrees. Some work on the abstraction of bytes, some on UTF-16 characters, some on code points. When the representation uses bytes or UTF-16 characters, there are usually parts of the API that give you access to the elements of this representation, as well as parts that perform the necessary logic to get from bytes (via UTF-8) or from UTF-16 characters to the actual code points.

Often, the parts of the API performing that logic and thus giving you access to the code points have been added later, as first there was 7 bit ascii, then a bit later everybody thought 8 bits were enough, using different code pages, and even later that 16 bits were enough for unicode. The notion of code points as integer numbers without a fixed upper limit was historically added as the fourth common character length for logically encoding text.

Using an API that gives you access to the actual code points seems like that's it. But...

Third, there are a lot of modifier code points affecting the next code point or following code points. E.g. there's a diacritic modifier turning a following a into an Γ€, e to Γ«, &c. Turn the code points around, and aΓ« becomes eΓ€, made of different letters. There is a direct representation of e.g. Γ€ as its own code point but using the modifier is just as valid.

Fourth, everything is in constant flux. There are also a lot of modifiers among the emoji, as used in the example, and more are added every year. Therefore, if an API gives you access to the information whether a code point is a modifier, the version of the API will determine whether it already knows a specific new modifier.

Unicode provides a hacky trick, though, for when it's only about the visual appearance:

There are writing direction modifiers. In the case of the example, left-to-right writing direction is used. Just add a right-to-left writing direction modifier at the beginning of the text and depending on the version of the API / browser, it will look correctly reversed 😎

'\u202e' is called right to left override, it is the strongest version of the right to left marker.

See this explanation by w3.org

const text = 'Hello worldπŸ‘©β€πŸ¦°πŸ‘©β€πŸ‘©β€πŸ‘¦β€πŸ‘¦'
console.log('\u202e' + text)

const text = 'Hello worldπŸ‘©β€πŸ¦°πŸ‘©β€πŸ‘©β€πŸ‘¦β€πŸ‘¦'
let original = document.getElementById('original')
original.appendChild(document.createTextNode(text))
let result = document.getElementById('result')
result.appendChild(document.createTextNode('\u202e' + text))
body {
  font-family: sans-serif
}
<p id="original"></p>
<p id="result"></p>
Ky answered 1/10, 2020 at 14:16 Comment(9)
+1 very creative use of bidi (-: It's safer to close the override with a POP DIRECTIONAL FORMATTING char '\u202e' + text + '\u202c' to avoid affecting following text. – Plebeian
Thanks 😎 It's quite a hacky trick and the article I linked to goes into a lot of detail explaining why it's way smarter to use the html attributes but this way I could just use string concatenation for my hack πŸ˜‚ – Ky
Oh, I misread, yes, that would be smarter, too. But making a horrible hack smarter sometimes makes it look more legit, which tends to make it more likely to actually be applied πŸ˜… – Ky
Btw. my firefox on this machine (win 10) doesn't get it entirely right, the children are behind the parents when writing right to left, I guess it's hard to get writing direction right with these massively complex emoji groups-of-people modifiers... – Ky
Another fun edge case: the regional indicator symbols used for flag emojis. If you take the string "πŸ‡¦πŸ‡¨" (the two code points U+1F1E6, U+1F1E8, making the flag for Ascension Island) and try to naively reverse it, you get "πŸ‡¨πŸ‡¦", the flag for Canada. – Albarran
About the edit requests: I intentionally used the term "characters", as not all UTF-16 characters are code points: some are lower and upper half surrogate pair characters, which is the entire point of UT-16, as oppose to prior Unicode versions that assumes that 16 bits were enough. I'll change it to 2 or 4 bytes, though, thats way better, thanks! ^^ The use of the term ucs4 is deliberate but I will add a clarification :) – Ky
Oh, now I saw it... the internal representation of UTF-16 strings is not made up of 2 or 4 bytes wide code points. It is made up of 2 bytes wide UTF-16 characters. Using the more semantic description in this place would be inaccurate would make it harder to understand what follows. You usually can create any sequence of UTF-16 characters, whether they can be translated to code points or not (like a sequence of five lower half surrogate pair elements) – Ky
@Ky FYI: "UTF-16 characters" (as you're using the term here) are otherwise known as "UTF-16 code units". "Character" tends to be too ambiguous of a term because it can refer to a lot of things (but in the context of Unicode usually a code point). – Haloid
Thanks for the better term :) The suggestion of code points was worse than characters (which is at used the term often used by the APIs) but code units is perfect ^^ Just changed it, plus clarification about naming in API) – Ky
S
40

I know! I'll use RegExp. What could go wrong? (Answer left as an exercise for the reader.)

const text = 'Hello worldπŸ‘©β€πŸ¦°πŸ‘©β€πŸ‘©β€πŸ‘¦β€πŸ‘¦';

const reversed = text.match(/.(\u200d.)*/gu).reverse().join('');

console.log(reversed);
Scene answered 30/9, 2020 at 21:38 Comment(5)
Your answer sounds apologetic but, honestly, I’d call this answer close to canonical. It’s definitely superior to other answers attempting to do the same thing manually. Character-based text manipulation is what regex is designed for and excels at, and the Unicode consortium explicitly standardises the necessary regex features (which ECMAScript happens to implement correctly, in this instance). That said, it fails to handle combining characters (which IIRC regex should handle with . wildcards). – Steed
Doesn’t work with compositions not built with U+200D, e.g. πŸ³οΈβ€πŸŒˆ. It’s worth noting that composed characters do also exist outside the Emijoi world… – Joviality
@StevenPenny πŸ³οΈβ€πŸŒˆ contains two compositions and one of them does not use U+200D. It’s easy to verify that πŸ³οΈβ€πŸŒˆ does not work with the code of this answer… – Joviality
@Joviality while its true that πŸ³οΈβ€πŸŒˆ contains a composition not built with U+200D, its a pretty bad example as it also contains a composition with U+200D. A better example would be something like πŸ§‘πŸ» or 🏳️ – Stanleigh
Conversely to the other comments here, not every use of a zero-width-joiner should be treated as a single grapheme cluster. For example, the last three lines of the unicode 13 grapheme test (unicode.org/Public/13.0.0/ucd/auxiliary/GraphemeBreakTest.txt) show three very similar cases where the ZWJ is handled differently. – Rafael
A
32

Alternative solution would be to use runes library, small but effective solution:

https://github.com/dotcypress/runes

const runes = require('runes')

// String.substring
'πŸ‘¨β€πŸ‘¨β€πŸ‘§β€πŸ‘§a'.substring(1) => 'οΏ½β€πŸ‘¨β€πŸ‘§β€πŸ‘§a'

// Runes
runes.substr('πŸ‘¨β€πŸ‘¨β€πŸ‘§β€πŸ‘§a', 1) => 'a'

runes('12πŸ‘©β€πŸ‘©β€πŸ‘¦β€πŸ‘¦3πŸ•βœ“').reverse().join(); 
// results in: "βœ“πŸ•3πŸ‘©β€πŸ‘©β€πŸ‘¦β€πŸ‘¦21"
Alo answered 1/10, 2020 at 7:45 Comment(4)
This is the best answer tbh. All these other answers have cases where they fail, this library (hopefully) meets all edge cases. – Fixity
This is funny that such "a simple question" at first look turned to be not an easy task to solve. Agree with Carson - library, hopefully, will move forward with updates and changes as Emojis keeps evolving. – Alo
Looks like this hasn't been updated for about 3 years. Unicode 11 was released about that time, but things have changed since then, with Unicode 13 being released later. There were some changes in the extended grapheme rules in 13. So there might be some edge cases this doesn't handle. (I've not looked through the code - but it is worth being careful with) – Rafael
I agree with @MichaelAnderson, this library appears to use a naive or old algorithm. To do this properly it should use the grapheme segmentation algorithm specified in Unicode. – Haloid
R
28

You don't just have trouble with emoji, but also with other combining characters. These things that feel like individual letters but are actually one-or-more unicode characters are called "extended grapheme clusters".

Breaking a string into these clusters is tricky (for example see these unicode docs). I would not rely on implementing it myself but use an existing library. Google pointed me at the grapheme-splitter library. The docs for this library contain some nice examples that will trip up most implementations:

Using this you should be able to write:

var splitter = new GraphemeSplitter();
var graphemes = splitter.splitGraphemes(string);
var reversed = graphemes.reverse().join('');

ASIDE: For visitors from the future, or those willing to live on the bleeding edge:

There is a proposal to add a grapheme segmenter to the javascript standard. (It actually provides other segmenting options too). It is in stage 3 review for acceptance at the moment and is currently implemented in JSC and V8 (see https://github.com/tc39/proposal-intl-segmenter/issues/114).

Using this the code would look like:

var segmenter = new Intl.Segmenter("en", {granularity: "grapheme"})
var segment_iterator = segmenter.segment(string)
var graphemes = []
for (let {segment} of segment_iterator) {
    graphemes.push(segment)
}
var reversed = graphemes.reverse().join('');

You can probably make this neater if you know more modern javascript than me...

There is an implementation here - but I don't know what it requires.

Note: This points out a fun issue that other answers haven't addressed yet. Segmentation can depend upon the locale that you are using - not just the characters in the string.

Rafael answered 1/10, 2020 at 4:43 Comment(6)
Looks like the code hasn't been updated for about 2 years - so its tables might not be up-to-date. So you might need to search for something more recent. – Rafael
Looks like a more recent fork of this library is available at github.com/flmnt/graphemer – Rafael
I'm surprised that I had to scroll this far down to see an answer that's actually correct. – Lithophyte
For the proposal example you could do const graphemes = Array.from(segment_iterator, ({segment}) => segment). – Haloid
Related: @rootEnginear's answer to "How can I split a string containing emoji into an array?" – Zaragoza
Uncaught ReferenceError: GraphemeSplitter is not defined – Tyika
S
17

I just decided to do it for fun, was a good challenge. Not sure it's correct in all cases, so use at your own risk, but here it is:

function run() {
    const text = 'Hello worldπŸ‘©β€πŸ¦°πŸ‘©β€πŸ‘©β€πŸ‘¦β€πŸ‘¦';
    const newText = reverseText(text);
    console.log(newText);
}

function reverseText(text) {
    // first, create an array of characters
    let textArray = [...text];
    let lastCharConnector = false;
    textArray = textArray.reduce((acc, char, index) => {
        if (char.charCodeAt(0) === 8205) {
            const lastChar = acc[acc.length-1];
            if (Array.isArray(lastChar)) {
                lastChar.push(char);
            } else {
                acc[acc.length-1] = [lastChar, char];
            }
            lastCharConnector = true;
        } else if (lastCharConnector) {
            acc[acc.length-1].push(char);
            lastCharConnector = false;
        } else {
            acc.push(char);
            lastCharConnector = false;
        }
        return acc;
    }, []);
    
    console.log('initial text array', textArray);
    textArray = textArray.reverse();
    console.log('reversed text array', textArray);

    textArray = textArray.map((item) => {
        if (Array.isArray(item)) {
            return item.join('');
        } else {
            return item;
        }
    });

    return textArray.join('');
}

run();
Shurwood answered 30/9, 2020 at 12:8 Comment(2)
Well, actually it’s long because the debug infos. I really appreciate that – Fever
@AndrewSavinykh Not a code-golf, but was looking for a more elegant solution. Maybe not like one-liner crazy, but easy to remember. Such as the regex solution is a really good one imho. – Fever
C
3

Using Intl.Segmenter()

const text = 'Hello worldπŸ‘©β€πŸ¦°πŸ‘©β€πŸ‘©β€πŸ‘¦β€πŸ‘¦';

[...new Intl.Segmenter().segment(text)].map(x => x.segment).reverse().join('');

// default granularity is grapheme so no need to specify options
// https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Segmenter/Segmenter
Chukar answered 31/5, 2023 at 15:55 Comment(2)
awesome solution, it's the native js API. – Tyika
AFAICT this is still a proposal (github.com/tc39/proposal-intl-segmenter) and not supported in all browsers. In particular the tables on the mozilla page show that Firefox does not support it yet. – Rafael
C
-4

You can use:

yourstring.split('').reverse().join('')

It should turn your string into a list, reverse it then make it a string again.

Caithness answered 20/10, 2020 at 13:58 Comment(1)
Did you read the question? Your code is exactly the code OP proved wrong in the question. – Robers
L
-5

const text = 'Hello worldπŸ‘©β€πŸ¦°πŸ‘©β€πŸ‘©β€πŸ‘¦β€πŸ‘¦';

const reversed = text.split('').reverse().join('');

console.log(reversed);

Lapidate answered 28/10, 2020 at 4:55 Comment(0)

© 2022 - 2024 β€” McMap. All rights reserved.