How to encode 32-bit Unicode characters in a PowerShell string literal?
Asked Answered
B

3

6

This Stack Overflow question deals with 16-bit Unicode characters. I would like a similar solution that supports 32-bit characters. See this link for a listing of the various Unicode charts. For example, a range of characters that are 32-bit are the Musical Symbols.

The answer in the question linked above doesn't work because it casts the System.Int32 value as a System.Char, which is a 16-bit type.

Edit: Let me clarify that I don't particularly care about displaying the 32-bit Unicode character, I just want to store the character in a string variable.

Edit #2: I wrote a PowerShell snippet that uses the info in the marked answer and its comments. I would have wanted to put this in another comment, but comments can't be multi-line.

$inputValue = '1D11E'
$hexValue = [int]"0x$inputValue" - 0x10000
$highSurrogate = [int]($hexValue / 0x400) + 0xD800
$lowSurrogate = $hexValue % 0x400 + 0xDC00
$stringValue = [char]$highSurrogate + [char]$lowSurrogate

Dour High Arch still deserves credit for the answer for helping me finally understand surrogate pairs.

Beckner answered 29/1, 2011 at 0:28 Comment(8)
Technically, there are no 32-bit Unicode code points, as Unicode is only a 21-bit code.Jarib
Seems nitpicky. Obviously U+1D11E doesn't use ALL 32 bits, but it is greater than 16 bits, thus why the question needed to be asked (since the linked question's answer only works for 16 bits). PowerShell and .NET have Int16 and Int32 types, is there one named Int21? Thus 32 is the next logical increment.Beckner
That ain't nitpicking. You're using Unicode terminology incorrectly, and being corrected. Unicode doesn't define characters as having the bitness property. It's incorrect to talk about "32-bit characters" or "16-bit characters", since Unicode defines neither concept. Character is an abstract writing symbol with various properties (like is it upper or lower case, is it RTL or LTR, &c). With how many bits it is encoded depends on the particular encoding used to encode the character into bytes. E.g. â is encoded to C3 A2 in UTF-8, and to E2 in ISO-8859-1 (aka Latin-1).Greenes
the other question already has updated answers for the full Unicode range. Try echo "`u{1F44D}" or echo [char]::ConvertFromUtf32(0x1F44D)Gonta
Does this answer your question? How do I encode Unicode character codes in a PowerShell string literal?Gonta
@Gonta Are you kidding me? I quoted that link in the first four words of my question. Ten years ago.Beckner
@ChuckHeatherly did you read that question again? It has answers for UTF-32Gonta
Yeah it didn't have those comments 10 years ago.Beckner
L
2

Assuming PowerShell uses UTF-16, 32-bit code points are represented as surrogates. For example, U+10000 is represented as:

0xD100 0xDC00

That is, two 16-bit chars; hex D100 and DC00.

Good luck finding a font with surrogate chars.

Lannie answered 29/1, 2011 at 1:35 Comment(3)
How did you translate U+1D100 to those two surrogates? I see where the first one (D100) came from, but how did you come up with DC00 for the second?Beckner
Apologies, it was a typo. Fixed. The formula is given in the Wikipedia linkLannie
OK, I found a link to i18nguy.com/unicode/surrogatetable.html, which shows how to look up the high and low surrogates, given the 32-bit value you want to encode. So the value 1D11E is given by the surrogate pair D834 DD1E. And so I would cast each of those 16-bit values to a System.Char and then put both into a string variable. Thanks for helping me understand surrogate pairs finally!Beckner
S
7

IMHO, the most elegant way to use Unicode literals in PowerShell is

[char]::ConvertFromUtf32(0x1D11E)

See my blogpost for more details

Shererd answered 13/6, 2014 at 14:43 Comment(5)
U+1F4A9 is much more relevant, satisfying and fitting example. Especially for testing PowerShell.Greenes
So there's no way to turn a string of emojis into a character array? '😀😁😂😃😄😅😆' 1f600-1f606Paff
@js2010, I'm not strong in PowerShell frankly... There is literally .ToCharArray() method on strings — but it outputs garbage. You'd have to somehow work around those surrogate pairs shenanigans which Microsoft loves so much. Maybe try UTF-8 and a less broken Unicode library?..Greenes
@Greenes I worked out a way with utf32 but it seemed like a lot of work #62392165Paff
@Greenes I posted a way to do it with a string of emojis here: stackoverflow.com/a/77622888Ampersand
L
2

Assuming PowerShell uses UTF-16, 32-bit code points are represented as surrogates. For example, U+10000 is represented as:

0xD100 0xDC00

That is, two 16-bit chars; hex D100 and DC00.

Good luck finding a font with surrogate chars.

Lannie answered 29/1, 2011 at 1:35 Comment(3)
How did you translate U+1D100 to those two surrogates? I see where the first one (D100) came from, but how did you come up with DC00 for the second?Beckner
Apologies, it was a typo. Fixed. The formula is given in the Wikipedia linkLannie
OK, I found a link to i18nguy.com/unicode/surrogatetable.html, which shows how to look up the high and low surrogates, given the 32-bit value you want to encode. So the value 1D11E is given by the surrogate pair D834 DD1E. And so I would cast each of those 16-bit values to a System.Char and then put both into a string variable. Thanks for helping me understand surrogate pairs finally!Beckner
J
0

FYI: If anyone wants to store surrogate pairs in a Case Sensitive HashTable, this seems to work:

$NCRs = new-object System.Collections.Hashtable
$NCRs['Yopf'] = [string]::new(([char]0xD835, [char]0xDD50))
$NCRs['yopf'] = [string]::new(([char]0xD835, [char]0xDD6A))
$NCRs['Yopf']
$NCRs['yopf']

Outputs:

𝕐
𝕪
Jarvisjary answered 12/12, 2022 at 23:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.