Spliting an emoji sequence in powershell
Asked Answered
M

2

1

I have a text box that will be filled with emoji only. No spaces or characters of any kind. I need to split these emoji in order to identify them. This is what I have tried:

function emoji_to_unicode(){
    foreach ($emoji in $textbox.Text) {
        $unicode = [System.Text.Encoding]::Unicode.GetBytes($emoji)
        Write-Host $unicode
    }
}

Instead of printing the bytes one by one, the loop is running just once, printing the codes of all the emoji joined together. It's like all the emoji was a single item. I tested with 6 emoji, and instead of getting this:

61 216 7 222

61 216 67 222

61 216 10 222

61 216 28 222

61 216 86 220

60 216 174 223

I'm getting this:

61 216 7 222 61 216 67 222 61 216 10 222 61 216 28 222 61 216 86 220 60 216 174 223

What am I missing?

Mucor answered 15/6, 2020 at 15:31 Comment(2)
Windows PowerShell or PowerShell Core? – Asa
Windows PowerShell – Mucor
D
1

A string is just one element. You want to change it to a character array.

foreach ($i in 'hithere') { $i }
hithere

foreach ($i in [char[]]'hithere') { $i }
h
i
t
h
e
r
e

Hmm this doesn't work well. These code points are pretty high, U+1F600 (32-bit), etc

foreach ($i in [char[]]'πŸ˜€πŸ˜πŸ˜‚πŸ˜ƒπŸ˜„πŸ˜…πŸ˜†') { $i }       
οΏ½  # 16 bit surrogate pairs?
οΏ½
οΏ½
οΏ½
οΏ½
οΏ½
οΏ½
οΏ½
οΏ½
οΏ½
οΏ½
οΏ½
οΏ½
οΏ½

Hmm ok, add every pair. Here's another way to do it using https://en.wikipedia.org/wiki/Universal_Character_Set_characters#Surrogates (or just use ConvertToUTF32($emoji, 0) )

$emojis = 'πŸ˜€πŸ˜πŸ˜‚πŸ˜ƒπŸ˜„πŸ˜…πŸ˜†'
for ($i = 0; $i -lt $emojis.length; $i += 2) {
  [System.Char]::IsHighSurrogate($emojis[$i])
  0x10000 + ($emojis[$i] - 0xD800) * 0x400 + $emojis[$i+1] - 0xDC00 | % tostring x
  # [system.char]::ConvertToUtf32($emojis,$i) | % tostring x  # or
  $emojis[$i] + $emojis[$i+1]
}


True
1f600
πŸ˜€
True
1f601
😁
True
1f602
πŸ˜‚
True
1f603
πŸ˜ƒ
True
1f604
πŸ˜„
True
1f605
πŸ˜…
True
1f606
πŸ˜†

Note that unicode in the Unicode.GetBytes() method call refers to utf16le encoding.

Chinese works.

[char[]]'ε—¨οΌŒζ‚¨ε₯½'
ε—¨
,
您
ε₯½

Here it is using utf32 encoding. All characters are 4 bytes long. Converting every 4 bytes into an int32 and printing them as hex.

$emoji = 'πŸ˜€πŸ˜πŸ˜‚πŸ˜ƒπŸ˜„πŸ˜…πŸ˜†'
$utf32 = [System.Text.Encoding]::utf32.GetBytes($emoji)

for($i = 0; $i -lt $utf32.count; $i += 4) {
    $int32 = [bitconverter]::ToInt32($utf32[$i..($i+3)],0)
    $int32 | % tostring x
}

1f600
1f601
1f602
1f603
1f604
1f605
1f606

Or going the other way from int32 to string. Simply casting the int32 to [char] does not work (have to add pairs of [char]'s). Script reference: https://www.powershellgallery.com/packages/Emojis/0.1/Content/Emojis.psm1

for ($i = 0x1f600; $i -le 0x1f606; $i++ ) { [System.Char]::ConvertFromUtf32($i) }

πŸ˜€
😁
πŸ˜‚
πŸ˜ƒ
πŸ˜„
πŸ˜…
πŸ˜†

See also How to encode 32-bit Unicode characters in a PowerShell string literal?

EDIT:

Powershell 7 has a nice enumeraterunes() method:

$emojis = 'πŸ˜€πŸ˜πŸ˜‚πŸ˜ƒπŸ˜„πŸ˜…πŸ˜†'
$emojis.enumeraterunes() | % value | % tostring x

1f600
1f601
1f602
1f603
1f604
1f605
1f606
Diathesis answered 15/6, 2020 at 15:41 Comment(0)
D
1

Solution   works on both PowerShell 5.1  &  7.4  !!

$str = 'hithereπŸ˜€πŸ˜πŸ˜‚πŸ˜ƒπŸ˜„πŸ˜…πŸ˜†'

[Globalization.StringInfo]::GetTextElementEnumerator($str) | &{process{
  # SURROGATE PAIR HAS LENGTH OF 2
  if( $_.Length -eq 2 ){
    $highSurr, $lowSurr = [char[]]$_ -as 'int[]'
    $surrPair = ($highSurr - 0xD800) * 0x400 + $lowSurr - 0xDC00 + 0x10000
    "{0} `t 0x{1:X} `t 0x{2:X} + 0x{3:X}" -f $_, $surrPair, $highSurr, $lowSurr
  }
  elseif( $_.Length -eq 1 ){
    "{0} `t 0x{1:X4}" -f $_, [int][char]$_
  }
}}

Result

h    0x0068
i    0x0069
t    0x0074
h    0x0068
e    0x0065
r    0x0072
e    0x0065
πŸ˜€   0x1F600     0xD83D + 0xDE00
😁   0x1F601     0xD83D + 0xDE01
πŸ˜‚   0x1F602     0xD83D + 0xDE02
πŸ˜ƒ   0x1F603     0xD83D + 0xDE03
πŸ˜„   0x1F604     0xD83D + 0xDE04
πŸ˜…   0x1F605     0xD83D + 0xDE05
πŸ˜†   0x1F606     0xD83D + 0xDE06

(Any misalignment is the fault of the way the site is rendered; it should be aligned in your terminal.)

How It Works

First we use the C# method [Globalization.StringInfo]::GetTextElementEnumerator to split the string into an array of Unicode characters; then, in the pipeline, we determine if it's a surrogate pair.

If it is a surrogate pair, we assign the high surrogate & low surrogate variables by casting the pipeline object into an array of char objects (limited to UTF-8 values), then cast it into an int array.

After that, we create a variable for the UTF-16 value through the use of a formula involving the high surrogate and low surrogate values.

Finally, we print the values using an f-string with hex formatting (using a lowercase X makes the letters within the hex value lowercase & vice versa).

If it isn't a surrogate pair, we simply cast the pipeline object into a char and then an int. We again use an f-string to print the value with hex formatting, but we give it a padding of 4 (placed right after the X) so values like 0x68 are instead printed as 0x0068.

Desalinate answered 7/12, 2023 at 20:20 Comment(2)
Nice........... – Diathesis
@Diathesis Posted a slight revision: $_.Length -eq 2 is much simpler than [Char]::IsSurrogatePair, not sure why I didn't think of that before... though I suppose [Char]::IsSurrogatePair is clearer without needing a comment. However, I don't completely trust [Char] methods, as they tend to be kept out-of-date for backward-compatibility purposes. – Desalinate

© 2022 - 2024 β€” McMap. All rights reserved.