How to split a string by emojis in C++
Asked Answered
A

1

7

I'm trying to take a string of emojis and split them into a vector of each emoji

Given the string:

std::string emojis = "πŸ˜€πŸ”πŸ¦‘πŸ˜πŸ”πŸŽ‰πŸ˜‚πŸ€£";

I'm trying to get:

std::vector<std::string> splitted_emojis = {"πŸ˜€", "πŸ”", "πŸ¦‘", "😁", "πŸ”", "πŸŽ‰", "πŸ˜‚", "🀣"};

Edit

I've tried to do:

std::string emojis = "πŸ˜€πŸ”πŸ¦‘πŸ˜πŸ”πŸŽ‰πŸ˜‚πŸ€£";
std::vector<std::string> splitted_emojis;
size_t pos = 0;
std::string token;
while ((pos = emojis.find("")) != std::string::npos)
{
    token = emojis.substr(0, pos);
    splitted_emojis.push_back(token);
    emojis.erase(0, pos);
}

But it seems like it throws terminate called after throwing an instance of 'std::bad_alloc' after a couple of seconds.
When trying to check how many emojis are in a string using:

std::string emojis = "πŸ˜€πŸ”πŸ¦‘πŸ˜πŸ”πŸŽ‰πŸ˜‚πŸ€£";
std::cout << emojis.size() << std::endl; // returns 32

it returns a bigger number which i assume are the unicode data. I don't know too much about unicode data but i'm trying to figure out how to check for when the data of an emoji begins and ends to be able to split the string to each emoji

Adjutant answered 6/8, 2020 at 3:11 Comment(7)
How much do you know about unicode and character encodings? – Russel
And what did the debugger say? – Lunneta
@d4rk4ng31 I use VSCode without a debugger setup – Adjutant
@d4rk4ng31 VSCode is a fully capable editor (and the most popular according to the SO developer survey). You just need to set it up with a debugger. – Motive
use a Unicode library instead. C++ stdlib doesn't have good support Unicode support and can't know the UTF-8 character boundaries – Resect
"I don't know too much about unicode data", so read the classic The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) – Messeigneurs
@Messeigneurs ill make sure to give it a read, thanks! – Adjutant
M
3

I would definitely recommend that you use a library with better unicode support (all large frameworks do), but in a pinch you can get by with knowing that the UTF-8 encoding spreads Unicode characters over multiple bytes, and that the first bits of the first byte determine how many bytes a character is made up of.

I stole a function from boost. The split_by_codepoint function uses an iterator over the input string and constructs a new string using the first N bytes (where N is determined by the byte count function) and pushes it to the ret vector.

// Taken from boost internals
inline unsigned utf8_byte_count(uint8_t c)
{
  // if the most significant bit with a zero in it is in position
  // 8-N then there are N bytes in this UTF-8 sequence:
  uint8_t mask = 0x80u;
  unsigned result = 0;
  while(c & mask)
  {
    ++result;
    mask >>= 1;
  }
  return (result == 0) ? 1 : ((result > 4) ? 4 : result);
}

std::vector<std::string> split_by_codepoint(std::string input) {
  std::vector<std::string> ret;
  auto it = input.cbegin();
  while (it != input.cend()) {
    uint8_t count = utf8_byte_count(*it);
    ret.emplace_back(std::string{it, it+count});
    it += count;
  }
  return ret;
}

int main() {
    std::string emojis = u8"πŸ˜€πŸ”πŸ¦‘πŸ˜πŸ”πŸŽ‰πŸ˜‚πŸ€£";
    auto split = split_by_codepoint(emojis);
    std::cout << split.size() << std::endl;
}

Note that this function simply splits a string into UTF-8 strings containing one code point each. Determining if the character is an emoji is left as an exercise: UTF-8-decode any 4-byte characters and see if they are in the proper range.

Messeigneurs answered 6/8, 2020 at 8:1 Comment(5)
can we use a like a wchar for emojis ? :P – Egin
Only if your platform has 32bit wchar_t and you can live with the wasted memory. Qt's QString uses UTF-32 internally, for example. Microsoft has 16bit wchar_t so it still needs surrogate characters to represent emoji. – Messeigneurs
Alternatively, you can use a UTF-8 iterator adaptor – Messeigneurs
Also note that it is not true that 1 Unicode Code Point = 1 Character, espacially on emojies. There are grapheme clusters, that take up more Unicode Characters. E.g πŸ‘¨β€β€οΈβ€πŸ‘¨ is 5 unicode characters (a male face, a heart and a female face, joined by ZWJs), or flags, consisting of U+1F3F4 Waving Flag, 2-5 CLDR characters idicating the country or region, and ` U+E007F` – Kare
Even through some emojis are 5 unicode characters for what I am doing this works perfectly – Adjutant

© 2022 - 2024 β€” McMap. All rights reserved.