How to encode 32-bit Unicode characters in a PowerShell string literal?

B

3

6

This Stack Overflow question deals with 16-bit Unicode characters. I would like a similar solution that supports 32-bit characters. See this link for a listing of the various Unicode charts. For example, a range of characters that are 32-bit are the Musical Symbols.

The answer in the question linked above doesn't work because it casts the System.Int32 value as a System.Char, which is a 16-bit type.

Edit: Let me clarify that I don't particularly care about displaying the 32-bit Unicode character, I just want to store the character in a string variable.

Edit #2: I wrote a PowerShell snippet that uses the info in the marked answer and its comments. I would have wanted to put this in another comment, but comments can't be multi-line.

$inputValue = '1D11E'
$hexValue = [int]"0x$inputValue" - 0x10000
$highSurrogate = [int]($hexValue / 0x400) + 0xD800
$lowSurrogate = $hexValue % 0x400 + 0xDC00
$stringValue = [char]$highSurrogate + [char]$lowSurrogate

Dour High Arch still deserves credit for the answer for helping me finally understand surrogate pairs.

Beckner answered 29/1, 2011 at 0:28 Comment(8)

Technically, there are no 32-bit Unicode code points, as Unicode is only a 21-bit code. – Jarib 29/1, 2011 at 20:10

Seems nitpicky. Obviously U+1D11E doesn't use ALL 32 bits, but it is greater than 16 bits, thus why the question needed to be asked (since the linked question's answer only works for 16 bits). PowerShell and .NET have Int16 and Int32 types, is there one named Int21? Thus 32 is the next logical increment. – Beckner 29/1, 2011 at 20:52

That ain't nitpicking. You're using Unicode terminology incorrectly, and being corrected. Unicode doesn't define characters as having the bitness property. It's incorrect to talk about "32-bit characters" or "16-bit characters", since Unicode defines neither concept. Character is an abstract writing symbol with various properties (like is it upper or lower case, is it RTL or LTR, &c). With how many bits it is encoded depends on the particular encoding used to encode the character into bytes. E.g. â is encoded to C3 A2 in UTF-8, and to E2 in ISO-8859-1 (aka Latin-1). – Greenes 5/1, 2015 at 16:47

the other question already has updated answers for the full Unicode range. Try echo "`u{1F44D}" or echo [char]::ConvertFromUtf32(0x1F44D) – Gonta 25/3, 2021 at 7:31

Does this answer your question? How do I encode Unicode character codes in a PowerShell string literal? – Gonta 25/3, 2021 at 7:36

@Gonta Are you kidding me? I quoted that link in the first four words of my question. Ten years ago. – Beckner 30/7, 2021 at 17:40

@ChuckHeatherly did you read that question again? It has answers for UTF-32 – Gonta 30/7, 2021 at 17:41

Yeah it didn't have those comments 10 years ago. – Beckner 30/7, 2021 at 18:36

L

2

Assuming PowerShell uses UTF-16, 32-bit code points are represented as surrogates. For example, U+10000 is represented as:

0xD100 0xDC00

That is, two 16-bit chars; hex D100 and DC00.

Good luck finding a font with surrogate chars.

Lannie answered 29/1, 2011 at 1:35 Comment(3)

How did you translate U+1D100 to those two surrogates? I see where the first one (D100) came from, but how did you come up with DC00 for the second? – Beckner 29/1, 2011 at 1:57

Apologies, it was a typo. Fixed. The formula is given in the Wikipedia link – Lannie 29/1, 2011 at 1:58

OK, I found a link to i18nguy.com/unicode/surrogatetable.html, which shows how to look up the high and low surrogates, given the 32-bit value you want to encode. So the value 1D11E is given by the surrogate pair D834 DD1E. And so I would cast each of those 16-bit values to a System.Char and then put both into a string variable. Thanks for helping me understand surrogate pairs finally! – Beckner 29/1, 2011 at 2:31

S

7

IMHO, the most elegant way to use Unicode literals in PowerShell is

[char]::ConvertFromUtf32(0x1D11E)

See my blogpost for more details

Shererd answered 13/6, 2014 at 14:43 Comment(5)

U+1F4A9 is much more relevant, satisfying and fitting example. Especially for testing PowerShell. – Greenes 5/1, 2015 at 16:52

So there's no way to turn a string of emojis into a character array? '😀😁😂😃😄😅😆' 1f600-1f606 – Paff 15/6, 2020 at 23:24

@js2010, I'm not strong in PowerShell frankly... There is literally .ToCharArray() method on strings — but it outputs garbage. You'd have to somehow work around those surrogate pairs shenanigans which Microsoft loves so much. Maybe try UTF-8 and a less broken Unicode library?.. – Greenes 16/6, 2020 at 13:14

@Greenes I worked out a way with utf32 but it seemed like a lot of work #62392165 – Paff 16/6, 2020 at 13:29

@Greenes I posted a way to do it with a string of emojis here: stackoverflow.com/a/77622888 – Ampersand 7/12, 2023 at 20:33

L

2

Assuming PowerShell uses UTF-16, 32-bit code points are represented as surrogates. For example, U+10000 is represented as:

0xD100 0xDC00

That is, two 16-bit chars; hex D100 and DC00.

Good luck finding a font with surrogate chars.

Lannie answered 29/1, 2011 at 1:35 Comment(3)

How did you translate U+1D100 to those two surrogates? I see where the first one (D100) came from, but how did you come up with DC00 for the second? – Beckner 29/1, 2011 at 1:57

Apologies, it was a typo. Fixed. The formula is given in the Wikipedia link – Lannie 29/1, 2011 at 1:58

OK, I found a link to i18nguy.com/unicode/surrogatetable.html, which shows how to look up the high and low surrogates, given the 32-bit value you want to encode. So the value 1D11E is given by the surrogate pair D834 DD1E. And so I would cast each of those 16-bit values to a System.Char and then put both into a string variable. Thanks for helping me understand surrogate pairs finally! – Beckner 29/1, 2011 at 2:31

J

0

FYI: If anyone wants to store surrogate pairs in a Case Sensitive HashTable, this seems to work:

$NCRs = new-object System.Collections.Hashtable
$NCRs['Yopf'] = [string]::new(([char]0xD835, [char]0xDD50))
$NCRs['yopf'] = [string]::new(([char]0xD835, [char]0xDD6A))
$NCRs['Yopf']
$NCRs['yopf']

Outputs:

𝕐
𝕪

Jarvisjary answered 12/12, 2022 at 23:55 Comment(0)

Recommended topics

Hot tags