How to split a string by emojis in C++ - McMap

About

How to split a string by emojis in C++

Asked 6/8, 2020 at 3:11 Answered 6/8, 2020 at 8:1

Solved c++emoji

A

1

7

I'm trying to take a string of emojis and split them into a vector of each emoji

Given the string:

std::string emojis = "😀🔍🦑😁🔍🎉😂🤣";

I'm trying to get:

std::vector<std::string> splitted_emojis = {"😀", "🔍", "🦑", "😁", "🔍", "🎉", "😂", "🤣"};

Edit

I've tried to do:

std::string emojis = "😀🔍🦑😁🔍🎉😂🤣";
std::vector<std::string> splitted_emojis;
size_t pos = 0;
std::string token;
while ((pos = emojis.find("")) != std::string::npos)
{
    token = emojis.substr(0, pos);
    splitted_emojis.push_back(token);
    emojis.erase(0, pos);
}

But it seems like it throws terminate called after throwing an instance of 'std::bad_alloc' after a couple of seconds.
When trying to check how many emojis are in a string using:

std::string emojis = "😀🔍🦑😁🔍🎉😂🤣";
std::cout << emojis.size() << std::endl; // returns 32

it returns a bigger number which i assume are the unicode data. I don't know too much about unicode data but i'm trying to figure out how to check for when the data of an emoji begins and ends to be able to split the string to each emoji

Adjutant answered 6/8, 2020 at 3:11 Comment(7)

How much do you know about unicode and character encodings? – Russel 6/8, 2020 at 3:16

And what did the debugger say? – Lunneta 6/8, 2020 at 4:17

@d4rk4ng31 I use VSCode without a debugger setup – Adjutant 6/8, 2020 at 4:35

@d4rk4ng31 VSCode is a fully capable editor (and the most popular according to the SO developer survey). You just need to set it up with a debugger. – Motive 6/8, 2020 at 5:42

use a Unicode library instead. C++ stdlib doesn't have good support Unicode support and can't know the UTF-8 character boundaries – Resect 6/8, 2020 at 5:43

"I don't know too much about unicode data", so read the classic The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) – Messeigneurs 6/8, 2020 at 7:21

@Messeigneurs ill make sure to give it a read, thanks! – Adjutant 6/8, 2020 at 12:35

M

3

I would definitely recommend that you use a library with better unicode support (all large frameworks do), but in a pinch you can get by with knowing that the UTF-8 encoding spreads Unicode characters over multiple bytes, and that the first bits of the first byte determine how many bytes a character is made up of.

I stole a function from boost. The split_by_codepoint function uses an iterator over the input string and constructs a new string using the first N bytes (where N is determined by the byte count function) and pushes it to the ret vector.

// Taken from boost internals
inline unsigned utf8_byte_count(uint8_t c)
{
  // if the most significant bit with a zero in it is in position
  // 8-N then there are N bytes in this UTF-8 sequence:
  uint8_t mask = 0x80u;
  unsigned result = 0;
  while(c & mask)
  {
    ++result;
    mask >>= 1;
  }
  return (result == 0) ? 1 : ((result > 4) ? 4 : result);
}

std::vector<std::string> split_by_codepoint(std::string input) {
  std::vector<std::string> ret;
  auto it = input.cbegin();
  while (it != input.cend()) {
    uint8_t count = utf8_byte_count(*it);
    ret.emplace_back(std::string{it, it+count});
    it += count;
  }
  return ret;
}

int main() {
    std::string emojis = u8"😀🔍🦑😁🔍🎉😂🤣";
    auto split = split_by_codepoint(emojis);
    std::cout << split.size() << std::endl;
}

Note that this function simply splits a string into UTF-8 strings containing one code point each. Determining if the character is an emoji is left as an exercise: UTF-8-decode any 4-byte characters and see if they are in the proper range.

Messeigneurs answered 6/8, 2020 at 8:1 Comment(5)

can we use a like a wchar for emojis ? :P – Egin 6/8, 2020 at 8:16

Only if your platform has 32bit wchar_t and you can live with the wasted memory. Qt's QString uses UTF-32 internally, for example. Microsoft has 16bit wchar_t so it still needs surrogate characters to represent emoji. – Messeigneurs 6/8, 2020 at 8:19

Alternatively, you can use a UTF-8 iterator adaptor – Messeigneurs 6/8, 2020 at 8:29

Also note that it is not true that 1 Unicode Code Point = 1 Character, espacially on emojies. There are grapheme clusters, that take up more Unicode Characters. E.g 👨‍❤️‍👨 is 5 unicode characters (a male face, a heart and a female face, joined by ZWJs), or flags, consisting of U+1F3F4 Waving Flag, 2-5 CLDR characters idicating the country or region, and ` U+E007F` – Kare 6/8, 2020 at 9:9

Even through some emojis are 5 unicode characters for what I am doing this works perfectly – Adjutant 6/8, 2020 at 12:38

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2024 — McMap. All rights reserved.