Perl ord and chr working with unicode
Asked Answered
R

2

10

To my horror I've just found out that chr doesn't work with Unicode, although it does something. The man page is all but clear

Returns the character represented by that NUMBER in the character set. For example, chr(65)" is "A" in either ASCII or Unicode, and chr(0x263a) is a Unicode smiley face.

Indeed I can print a smiley using

perl -e 'print chr(0x263a)'

but things like chr(0x00C0) do not work. I see that my perl v5.10.1 is a bit ancient, but when I paste various strange letters in the source code, everything's fine.

I've tried funny things like use utf8 and use encoding 'utf8', I haven't tried funny things like use v5.12 and use feature 'unicode_strings' as they don't work with my version, I was fooling around with Encode::decode to find out that I need no decoding as I have no byte array to decode. I've read much more documentation than ever before, and found quite a few interesting things but nothing helpful. It looks like a sort of the Unicode Bug but there's no usable solution given. Moreover I don't care about the whole string semantics, all I need is a trivial function.

So how can I convert a number into a string consisting of the single character corresponding with it, so that for example real_chr(0xC0) eq 'À' holds?


The first answer I've got explains quite everything about IO, but I still don't understand why

#!/usr/bin/perl -w
use strict;
use utf8;
use encoding 'utf8';

print chr(0x00C0) eq 'À' ? 'eq1' : 'ne1', " - ", chr(0x263a) eq '☺' ? 'eq1' : 'ne1', "\n";

print 'À' =~ /\w/ ? "match1" : "no_match1", " - ", chr(0x00C0) =~ /\w/ ? "match2" : "no_match2", "\n";

prints

ne1 - eq1
match1 - no_match2

It means that the manually entered 'À' differs from chr(0x00C0). Moreover, the former is a word constituent character (correct!) while the latter is not (but should be!).

Retire answered 5/9, 2012 at 23:48 Comment(5)
@D.Shawley: Linux 2.6.32-42-generic, x86_64 GNU/Linux, Ubuntu 10.4, so utf8 is nativeRetire
The UTF8 octet sequence for Á is C3 81 C1 is the ISO-8859-1 code point. My Perlfu is a little weak or I would propose an answer.Chanda
Some of the docs are weak in this regard, but the UTF-8 implementation, even in back Perl 5.10.1 is pretty strong. I'd recommend reading perlunitut and perluniintro before you go too far in working with Unicode in Perl. In your case, chr is not the problem, it's that you are not encoding and decoding your strings for UTF-8. If you're going to output UTF-8 (or any other encoding), your strings of characters need to be converted to octets on the way out first.Moral
@ikegami: Removing it changes the output to eq1 - eq1; match1 - no_match2. So I have two equal strings with just one of them matching.Retire
Unicode regex support in anything prior to Perl 5.14 is broken. In Perl 5.14, your second regular expression is fixed without use encoding and the /u modifier appended. See Character set modifiers in perlreMoral
D
13

First,

perl -le'print chr(0x263A);'

is buggy. Perl even tells you as much:

Wide character in print at -e line 1.

That doesn't qualify as "working". So while they differ in how fail to provide what you want, neither of the following gives you what you want:

perl -le'print chr(0x263A);'

perl -le'print chr(0x00C0);'

To properly output the UTF-8 encoding of those Unicode code points, you need to tell Perl to encoding the Unicode points with UTF-8.

$ perl -le'use open ":std", ":encoding(UTF-8)"; print chr(0x263A);'
☺

$ perl -le'use open ":std", ":encoding(UTF-8)"; print chr(0x00C0);'
À

Now on to the "why".

File handle can only transmit bytes, so unless you tell it otherwise, Perl file handles expect bytes. That means the string you provide to print cannot contain anything but bytes, or in other words, it cannot contain characters over 255. The output is exactly what you provide:

$ perl -e'print map chr, 0x00, 0x65, 0xC0, 0xF0' | od -t x1
0000000 00 65 c0 f0
0000004

This is useful. This is different then what you want, but that doesn't make it wrong. If you want something different, you just need to tell Perl what you want.

By adding an :encoding layer, the handle now expects a string of Unicode characters, or as I call it, "text". The layer tells Perl how to convert the text into bytes.

$ perl -e'
   use open ":std", ":encoding(UTF-8)";
   print map chr, 0x00, 0x65, 0xC0, 0xF0, 0x263a;
' | od -t x1
0000000 00 65 c3 80 c3 b0 e2 98 ba
0000011

Your right that chr doesn't know or care about Unicode. Like length, substr, ord and reverse, chr implements a basic string function, not a Unicode function. That doesn't mean it can't be used to work with text string. As you've seen, the problem wasn't with chr but with what you did with the string after you built it.

A character is an element of a string, and a character is a number. That means a string is just a sequence of numbers. Whether you treat those numbers as Unicode code points (text), packed IP addresses or temperature measurements is entirely up to you and the functions to which you pass the strings.

Here are a few example of operators that do assign meaning to the strings they receive as operands:

  • m// expects a string of Unicode code points.
  • connect expects a sequence of bytes that represent a sockaddr_in structure.
  • print with a handle without :encoding expect a sequence of bytes.
  • print with a handle with :encoding expect a sequence of Unicode code points.
  • etc

So how can I convert a number into a string consisting of the single character corresponding with it, so that for example real_chr(0xC0) eq 'À' holds?

chr(0xC0) eq 'À' does hold. Did you remember to tell Perl you encoded your source code using UTF-8 by using use utf8;? If you didn't tell Perl, Perl actually sees a two-character string on the RHS.


Regarding the question you've added:

There are problems with the encoding pragma. I recommend against using it. Instead, use

use open ':std', ':encoding(UTF-8)';

That'll fix one of the problems. The other problem you are encountering is with

chr(0x00C0) =~ /\w/

It's a known bug that's intentionally left broken for backwards compatibility reasons. That is, unless you request a more recent version of the language as follows:

use 5.014;    # use 5.012; *might* suffice.

A workaround that works as far back as 5.8:

my $x = chr(0x00C0);
utf8::upgrade($x);
$x =~ /\w/
Dybbuk answered 6/9, 2012 at 0:15 Comment(5)
If you want a more manual solution, you can also use Encode qw( encode ); print encode(chr(0x00c0)) which does pretty much the same thing as ":encoding(UTF-8)" above without an I/O handle.Moral
@zostay, That should be encode('UTF-8', chr(0x00c0)) or encode_utf8(chr(0x00c0)), but yeah.Dybbuk
@ikegami: You're surely right about the IO, but I don't care about IO, as it was just a debugging output. I'm not gonna output anything but ASCII, that's why I ignored the warning. I still don't understand what's going on, see my edit.Retire
@maaartinus, Added answer to added question. Note that 5.14 is the oldest supported version of Perl.Dybbuk
@ikegami: That's it, with my version I can't do use 5.014, but utf8::upgrade helped!Retire
H
-1

To my horror I've just found out that chr doesn't work with Unicode, although it does "something"

'chr' and 'ord' in man perlfunc , important related entities in man perlop ... Er, let us get a little bit into a not little mystery and say : (*13) being happy before the end is hard, maybe there is something interesting to tell, before . Maybe the problem is pressure coming from outside of Perl

My dream is : figuring out history and motivations of important behavior of Perl

--- ABBREVIATIONS ---

eaka  :=  erroneously aka  :=  erroneously also known as 

--- THE POST ---

As of April 2024 GMT, the character encoding subsystems in Perl or GNU Emacs or in other systems cannot be understood without reference to the history and current stand of character encodings.

Does somebody think that this history is boring? This is not true, for example, as of April 2024 there exists a urban legend saying that cp1252 is the same as the latin-1 char encoding (*3) (*19).

Even worse: as of April 2024 GMT another urban legend says that UTF-8 and Unicode are the same thing.

As a result, when in GNU Emacs people press/type the command

C-x =      , aka      M-x what-cursor-position

and hence, as it is right and just, the Unicode number X associated to the point(ed) character C is displayed, some people risk to think that X is the UTF-8 encoding of the char C

By the way, if X is expressed in hexadecimal form then in GNU Emacs you can insert the character Unicode X with : C-q X , aka M-x quoted-insert X

Maybe this urban legend is also one of the causes why, in Perl programs and Perl official documentation, by official explicit and dramatic decision of Perl designers (or Perl documentation designers, probably a tough profession at least as of ~ year 2005 GMT) we have, so to speak: "UTF-8" is not UTF-8, it is only kind of 'UTF-8 or a competitor of it' (*1) . In the present writing we will need to quote or refer to important texts written under this dramatic change of meaning but, for striving for meaning detachment, will spend energy for keeping out of that change in our writing .




THE MISSION : given a non-negative integer number X we want to try to

  1. transform X into something existing also outside of our mind, i.e. (*13) into a finite string of byte values which can be transformed into a character or symbol in some widely shared way (example: stdout to a virtual terminal in our computer ( PC or a Mac , etc) under GNU Linux).

  2. all of this under a preliminary lookup to a version of : Unicode

In this context, X is called 'CODE POINT'


The Perl programs

print "\x{41} " ; 

print chr 0x41 ;

do run but (*13) are not suited for our mission, being explicitly risking to not trigger invocation of US-ASCII (and thus also not of Unicode). While the programs

print "\N{U+41} " ; 

print chr utf8::unicode_to_native 0x41 ; 

are theoretically good answers but if we here change the strings '41' with an innocent 'e4' then (*13) the resulting perl programs could be not enough assertive for having flawless meanings , or at least for running them flawless , in any reasonable context



First of all , let us contradict a urban legend by recalling that Unicode and UTF-8 (aka utf-8, aka utf8, eaka Unicode) are different things. For example :

Numeric Range : X <= dec 127 / Example / The code point X for 'A' in US-ASCII , and in ISO-8859-1 (aka Latin1 or latin-1) , and in Unicode is : decimal 65 , aka hex 41

Numeric Range : dec 128 <= X <= dec 255 / Example / The code point X for 'ä' in latin-1 and in Unicode is : decimal 228 , aka hex E4

Numeric Range : X >= dec 256 / Example / The code point X for '쎤' in Unicode or at least in some relative improvement of Unicode is : decimal 50084 , aka hex C3A4

US-ASCII is virtually extended by ISO-8859-1 aka latin-1 , which in turn is virtually extended by Unicode. Thus, virtually we can think of Unicode as an extension of latin-1 (and hence also of US-ASCII).

On the contrary: the UTF-8 code for 'ä' is not the above 228, it is the byte value sequence : decimal 195 164 . Two byte values , which numerically don't have here-interesting relationship with ä's codepoint dec 228 above. UTF-8 is -- different from and -- heavily non-compatible with latin-1 (for example, in general latin-1 text looks funny at UTF-8 terminals) and thus : UTF-8 is different from Unicode.




As for an above


 >   Now on to the "why". 


, coming from year 2012 GMT, a long immersion in Perl's way of digital life and of its history may craft interesting hypotheses. Let us start with the following Perl PSEUDO-program


$foo1 = X ; print chr $foo1 ; # line 100

( this is pseudo-Perl , but you fix it if you here replace X with a non-negative integer numerical constant valid in Perl , for example 0xE4 which stands for decimal 228 )

Maybe for figuring out how perl will behave in this case, solving a crucial interpretation problem may be helpful. Unfortunately, the 'pratical concrete translation to be run by hardware' of the strings

"Perl interprets this in the platform's native encoding"

"so the number is interpreted in the native character set encoding" 

, found in the awesome words of 'man perlop's subsection [8] in NOTE (*10), seems to be dramatically important and is not very clear to me. These documentation lines refer to a non-negative integer number less or equal to dec 255 , called X in the present writing. My hypothesis is that (*13) probably these strings mean ' perl acts binary, without thinking to text ' , that is perl simply translates the number X into byte value X, that is, in important cases, despite the abbreviation 'chr' in ' chr X ' and despite the alphabet-oriented meaning of double quotations like "\x{e4}" and "\o{344}" and "\344" in Perl, in important cases perl (except in case of \N{...}!) simply falls-back to the most immediate and universal "hardware-portrait of X" available in the whole history of electronic information technology: simply packing X as one single byte value, (*5).


Also because inspired by this hypothesis I SUPPOSE that in some cases, (*13) by running line 100 above perl is going to follow something similar to the following interesting policy :

-- IF X >= decimal 256 (as of years ~2005 GMT this was the new (simple) case (*13)) : doing lots of things, i.e. looking up entity number X in some relatively improved Unicode currently adopted by the actual given computer system and sending to stdout the encoding (*9) of that entity (*12)

-- IF X <= decimal 255 : (*13) BRUTALLY SENDING X TO STDOUT. Absolutely unchanged and (*13) packed as one single byte value (*5). As ...

... As THEN, as of years ~ 2005 GMT, (*13) this was the difficult case due also to the huge amount of pre-existing computers deeply using legacy perl subsystems. As IF X <= decimal 255 THEN (*13) the encoding problem may anyway be a can of worms ( but of possibly VALUABLE, SPECIFIC worms, don't throw them away ! ) and probably the past specific human perl programmer was more aware than perl about important implications for the specific context where the given Perl code was and is run, (*13) the rule of thumb reads : do what most current users of unmaintained Perl programs would like. i.e. do almost nothing, more precisely: be as simple and humble as possible by 1. merely passing X to stdout and 2. by doing that as smoothly and simply as possible , that is: by sending ' byte value X ' (*5).

Thus taking advantage of the historical fact that as of more or less years ~ 2005 GMT many programmers and computer subsystems expected to receive only integer numbers 0 <= X <= dec 255 via this ultra-elementary encoding, maybe even more fundamental and elementary of US-ASCII, primitive but valid absolute and reliable no matter under which concrete OS (*13).

Maybe (but sometimes not) with a layer of textual meaning waiting for perl's stdout (and the general legacy perl instance not willing to get into that layer, without a good reason to do that), textual meaning often (but not always) created by triggering the forces of "old" ( "native" or "legacy" , mostly truly-eight-bit (*13)) character encodings -- such as US-ASCII , or ISO-8859-1 to ISO-8859-16, or EBCDIC , or latin-1, or cp1252 eaka latin-1 , or Mac OS Roman etc (*13), known to the given specific perl programmer but maybe often not formally declared to the running perl instance ; and with OSs like GNU-linux thinking often to "liquid" streams of raw byte values and often not willing to deal with that layer, at least not in that era.


Part of this for the sake of Perl's backwards compatibility.

That said, let us discuss some little Perl programs for accomplishing the above mission, all written in pure US-ASCII (thus the 'use utf8 ;' pragma is not necessary and thus not used).


First of all it is natural to try also with the string constructor of Perl.

In this case there is an important difference involving the "\N{U+...}" syntax to keep in mind, let us see this with the numeric string '41' as an example. The "\N{U+41}" syntax , and more generally also any Perl code regions with Perl's "double-quotish" creation of strings such as s/...\N{U+41}.../.../gx or so , mean (*13) two things:

A. we must think to hex 41 . And not to dec 41 , nor to oct 41 etc

B. what exits the hexadecimal interpretation in step A is understood as Unicode entity no. X whence, depending on the host system, sometimes a translation step from Unicode into some competitor of it (translation example: obtained via utf8::unicode_to_native ) is necessary . Otherwise no translation occurs .

(By the way here a subsequent additional step C. of translation into some multi-byte char encoding such as a version of UTF-8 or maybe something like a UTF-EBCDIC or other may run, but this is a separate issue).

All of this as opposite to "\x{41}" , where by specification step A. but NOT step B. do apply. Thus B. is absent, it is responsability of the programmer to ensure that X is correct for the further processing (possibly 1. including the above step C; but also not, i.e. skipping it because instead 2. invoking a "legacy", "native" encoding such as a truly-8-bit EBCDIC or latin-1 or Mac Roman or ISO-8859-5 etc ; or NOT EVEN this but rather maybe opening to individual or collectivity-related "madness" by 3. simply in order to allow "running" binary, non-textual interpretations of X even if eccentrically invoked with \x{...} (or, even worse, with \N{...} ?!?) or so). As a result, under change of context "\x{41}" is less stable than a relatively portable term "\N{U+...}", thus sometimes we could prefer to translate \x{41} into a more portable \N{U+X} (with a suited Unicode code point X possibly obtained by translating (hex) 41 and thus possibly different from (hex) 41).


That said, let us try to print our dear 'char hex 41' to terminal with the following five perl programs :

print "\N{U+41} "  ; # right  
print "\N{41} "    ; # wrong , this doesn't run 
print "\N{65} "    ; # wrong , this doesn't run 
print "\N{0x41} "  ; # wrong , this doesn't run 
print "\x{41} "    ; # NOT WRONG, per se. But perl could feel uneasy, will s/he understand me for the ethernity ? 

Let us now run what some Unicode faith embedded in our system thinks about code point hex c3a4. By running at our utf8 eaka Unicode virtual terminal the Perl program

print "\N{U+c3a4} "  ;

, we get:

$ perl -w a.pl 
Wide character in print at a.pl line 6.
쎤

So far so good, although Perl seems to be concerned ("Wide character in print"... )




Running

print "\N{U+e4} "  ;

, we get:

$ perl -w a.pl 
�

Thus probably we failed, because hex E4 should be the Unicode code point for the letter 'ä'. This time Perl is not sorry, but our Unicode virtual terminal (maybe wrong name, maybe I should rather say: our utf-8 virtual terminal) feels confused and displays '�' . A single dec 65 byte value would be a valid utf-8 stuff. But our single dec 228 byte value isn't. We get � because the 'literal compatibility' of utf-8 to latin-1 (and thus to Unicode) breaks for X > dec 127

After meditation here we could feel to experience the above interesting policy because it turns out that in this case the virtual terminal happens to receive exactly one byte value dec 228 , in this case no -> Unicode -> utf-8 conversion steps are run by perl. But with further long meditation we can rather imagine that probably , due to the \N{U+...} above, here that policy hasn't really been run; probably, due to the \N{U+...} invocation above, here perl 'thinks textual, not binary' ; but - - just for this time, next time who knows - - perl happens to decide to nevertheless do almost nothing , which in turn happens to 'look binary' , also here perl happens to send to stdout exactly what the above (more ambiguous) program " print chr X ; " with X = dec 228 would - - probably because of pretty casual reasons and circumstances which could change in the future.

Things can look a bit unpredictable also without artificial intelligence.




This or kind of this may be part of Perl's answers to big problems arising from the growth (growth possibly involving a certain amount of mess and turbulence) towards a new Unicode order. Above we wanted to be pretty perl-aware and thus said " \N{U+e4} " for saying : 1. we mean hex , and 2. we refer to some Unicode or relatively improved Unicode. Not to other multi-byte or "native"/"legacy" encodings, nor to binary uses (*5) of X = hex E4 .

So maybe theoretically we have the right of seeing 'ä' at our UTF-8 eaka Unicode terminal without writing further, possibily cumbersome directives. Definitely Perl is aware of this, but maybe s/he has some problems:

  1. even if Perl thinks that we are aware and up to date about the very precise and textual meaning of "\N{U+...}" in Perl and thus we want REALLY to see ä , Perl doesn't know whether the Unicode char no. dec 228 will be delivered to a destination using some kind of UTF-8 ; for example it could be a virtual terminal set to latin-1 instead, a frequent case near year 2005 GMT , and in this latin-1 case the encoding of our beloved ä aka 'Unicode character no. dec 228' would be simply not a UTF8-bunch made of 6 (or less many) byte values, instead: one single byte value 228 ; but this byte value can be an unknown or difficult event for virtual terminals set to some version of UTF-8, probably this is the origin of the "\N{U+e4}" triggering a � at the virtual terminal above . The latin-1 encoding is very close to the Unicode planet, the UTF-8 cloud of encodings is less close.

  2. Less likely circumstance, nevertheless not impossible: we could be a "legacy programmer" not aware of the textual meaning of \N{...}, or anyway a distressed or disgruntled (or simply willing to be eccentric) programmer who in "latin-1 times" wrote loosely "\N{U+e4}" (even more risky expresson would be: simply "ä") just for sending byte value hex E4 to output for non-textual purposes, not because s/he wanted to obtain a letter ä displayed somewhere and hex E4 is the code for ä under latin-1 and under Unicode; should this have been the case then even the safe-looking "\N{U+e4}" had been unsafe, because for example later maintenance could for example 'use open ":std", ":encoding(UTF-8)" ; ' , turning the output of "\N{U+e4}" into a new byte value sequence encoding ä according to some USB-EBCDIC or some USB-8 or so, instead of simply into 'byte value hex e4'.


And GNU-linux with a Perl subsystem in its core is a de facto industry standard thus, should the existence of dozens of persons depend on the absence of improvements which could trigger fatal changes in the output of some Perl code pre-existing a Perl evolution, then probably protecting the human society from "earthquakes" which could be triggered by "evolution" takes precedence over our above right.


This does not mean that perl will ignore our request to see ä , just: due to the burden of the past, we must be more explicit and more assertive. For example by being more precise, for example : usually perl tries to guess details (for example when we 'use utf8 ; ' or 'use open ":std", ":encoding(UTF-8)" ; ' , or run line 11 below) but sometimes -- depending also on how much hints we wrote in our Perl program and due also to the forces of the "viscosity of history" -- , Perl could feel forced to run interpretation into something different from what we reasonably expect from Perl.


So maybe in all above not only perl runs, maybe what also runs is an old pattern in history and sociology which in our context reads more or less : sometimes, in life, no matter what perl and you decide, you are always wrong. It's like The San Andreas Fault, an important piece pushes to a direction, another important piece to another direction.



My hypothesis is that the above interesting policy applies not only to perl commands explicitly invoking chr such as " print chr X ; " , but also with X in the Perl's double-quotish string constructors (except \N{...} !) listed in NOTE (*20) (except \N{...} !) , see also the awesome words of 'man perlop's subsection [8] in NOTE (*10) ; in other words we're speaking on "hex espressions" such as \x1b or \x{e4} (except \N{...} !) , or "octal expressions" such as \377 aka \o{377} (except \N{...} !)




Thanks to maaartinus for brilliantly pointing out a serious problem and to ikegami for bringing an interesting solution, to which I would try to add more portability (for example EBCDIC portability) with the following example using the Perl function utf8::unicode_to_native :

use open ":std", ":encoding(UTF-8)" ; # line 0 / This (*13) is intended also to keep lines 1,4 below from sending to stdout the encoding of ä in  latin-1 or so 


print chr utf8::unicode_to_native 0xe4 ; # line 1 / Thanks line 0. here (*13) we try to tell to perl to try to send to stdout the sequence of bytes which stand for ä , for example in (*1) UTF-EBCDIC, or UTF-8 eaka Unicode . Instead of latin-1 or some other "native" encoding (maybe cp1252 eaka latin-1, or Mac Roman ?) 


print ' ' ; # REM / As an example, the above line 1 happened to send to stdout bytes dec 195 164 , which sound pretty "UTF-8"ish and happened to draw my desired 'ä' at my advanced virtual terminal , finally and after long struggling ! 


print      chr                         0x41   ; # line 1.5 / due to line0, this is going to encode char no. hex 41 of some version of UTF-EBCDIC ; or of a UTF-8 eaka Unicode , in which case it should print 'A' 


print      chr utf8::unicode_to_native 0xe4   ; # line 2 / with or without the line 0 correction here we think to: char Unicode U+e4  i.e char 'ä', no US-ASCII available. As opposed to line 1.5 . Nevertheless , if we omit line 0 , then Perl could stdout the latin-1 (not A UTF ) encoding of ä 


print      chr utf8::unicode_to_native 0xc3a4 ; # line 3 / Anyway Unicode U+C3A4; and , due to the big code point X = hex C3A4 >= dec 256 , even without line 0 above Perl will skip "native"s and pass to some complicated UTF-EBCDIC  or a UTF-8 or so 


my $foo2 = chr utf8::unicode_to_native 0xe4   ; # This line is similar to line 3. Here $foo2 holds a byte value sequence which is a perl internal representation, but with line 4 we hope to convert this to the code of ä into latin-1 or to some UTF or so and send it to stdout 

print $foo2 ; # line 4 

Remember that if you really want also in lines 1,4 the stdout determined under a initial interpretation stage based of some relative improvement of some kind of Unicode (as opposed to : EBCDIC or some "native" char encoding culture ) then the utf8::unicode_to_native function cannot simply be omitted


The above pragma

use open ":std", ":encoding(UTF-8)" ;

solves also our above problem with

print "\N{U+e4} "  ; 

, whose output under our (UTF-8 eaka Unicode)- virtual terminal now resolves into : drawing a symbol "ä" on the screen ; moreover, also some above error messages "Wide character in print"... are absent now


By the way, the non-portable program

print "\x{e4} " ; 

seems to be similar to

print chr 0xe4 ;
print chr  228 ; # same meaning 

On the contrary, the

print "\N{U+e4} " ; # line 10 

seems to be similar to to the :

print chr utf8::unicode_to_native 0xe4 ; # line 11 
print chr utf8::unicode_to_native  228 ; # same number as in line 11 

perl v5.38.2 , April 2024 GMT



ABBREVIATIONS

eaka : erroneously aka 


NOTES

NOTE (*1) We saw that Unicode is not UTF-8 . Time has come to make another dramatic discovery : in the Perl planet, sometimes "UTF-8" is not UTF-8 . More precisely: despite the 'utf8' or 'UTF-8' in either of crucial lines like

print chr utf8::unicode_to_native 0x41 ; 

use open OUT => :utf8 ;

use open ":std", ":encoding(UTF-8)" ; 

use utf8 ;

or in other Perl code pearls -- and depending on the actual computer system running our program -- , even if we write literally utf8 or UTF-8 in our Perl program, perl may happen to have to adopt (and hence to SILENTLY adaptively adopt :-) a complex encoding which is alien to the UTF-8 clan. More precisely (*13) if we write utf8 or so then perl could adopt at least a UTF-EBCDIC encoding from the IBM world, not a UTF-8 eaka Unicode .

Similar warnings apply for the strings 'utf8' or 'UTF-8' in official Perl documentation (although not everywhere) , for example and more precisely:

*! The 'UTF-8' section of the 'perlunicode' manpage (*13) reads also : << UTF-8 is a variable-length (1 to 4 bytes), byte-order independent encoding. In most of Perl's documentation, including elsewhere in this document, the term "UTF-8" means also "UTF-EBCDIC". But in this section, "UTF-8" refers only to the encoding used on ASCII platforms. It is a superset of 7-bit US-ASCII, so anything encoded in ASCII has the identical representation when encoded in UTF-8. >>

*! The 'perlunicode' manpage (*13) reads also : << Unless ASCII vs. EBCDIC issues are specifically being discussed, references to UTF-8 encoding in this document and elsewhere should be read as meaning UTF-EBCDIC on EBCDIC platforms. See "Unicode and UTF" in perlebcdic. Because UTF-EBCDIC is so similar to UTF-8, the differences are mostly hidden from you; "use utf8" (and NOT something like "use utfebcdic") declares the script is in the platform's "native" 8-bit encoding of Unicode. (Similarly for the ":utf8" layer.) >>

*! The 'perluniintro' manpage (*13) reads also : << Often, documentation will use the term "UTF-8" to mean UTF-EBCDIC as well. This is the case in this document. >>


NOTE (*3): this confusion is a powerful standard for trying to craft funny random lumps of symbols on computer screen or on paper. By toying, for example using Perl or GNU Emacs


NOTE (*5): basing only on the bare mathematical meaning of X , simply packing the number X as one single byte value (*8) (*5b) .


NOTE (*5b): more precisely: with X integer number such that 0 <= X < dec 256 ( probably these numbers X are older than most alphabets, anyway they are more fundamental) , 'byte value X' is the hardware entity ( for example, as translated and set/stored/temporarily established in a given specified sub-unit of persistent change in a telecom line or other hardware such as physical RAM, or hard disk, or optical disc or so ) whose unique internationally recognized primordially associated numerical value is X (*5c). All of this is going to de facto refer to a PRIMORDIAL OCTAL ENCODING representation system of X made of official or de facto industry standards (with subsystems and sub-subsystems etc) which is even more fundamental than its by far most widespread kind-of-competitor US-ASCII (which however is 7-bit, i.e. limited to 0 <= X <= dec 127 included).


NOTE (*5c): We assume that, at least as long as sequences of truly-eight-bit bytes are concerned, for any above value of X the representation of X called ' byte value X ' is unambiguously defined at hardware level, important particular case : the entity ' byte value 0 ' aka ' byte value zero ' aka ' null byte ' ( aka NUL or NULL in some programming languages and telecommunication systems).


NOTE (*8) outputting X into byte value: 1. basing only on the numerical meaning of X and 2. totally disregarding any notion or question or issue or meaning having to do with the notions of alphabet, letter, printable symbol, character, sign or signal, char encoding etc. And even if the X in the current \x{...} (or similar above Perl expression/operator) is dec 65 and this is a computer system using a truly-eight-bit EBCDIC dialect as "native" char encoding where (*13) the code point for 'A' is dec 193 instead of latin-1's dec 65 : then so be it, don't translate, you don't know, switch to primordial behavior, THINK BINARY NOT TO TEXT and thus return dec 65 not dec 193. However, sometimes "afterthoughts" (involving also alphabet- or character-oriented-issues or so) could happen to be "run" later externally to the latter expression or to the current process or to the computer subsystem , for example it could be run :

  1. directly by the currently running perl instance process , for example if we 'use open ":std" , ":encoding(UTF-8)"; ' or similar afterthoughts
  2. by external processes basing on the output of that perl instance or, finally,
  3. only in the mind of a human being watching output of processes run or running under some computer OS

NOTE (*9) i.e. the translation into a -- more or less fixed and potentially multi-byte -- encoding (declared/selected in/by the current perl process; if not in force in the overall computer environment/context/system perl happens to be running under ); for example some advanced (or at least complicated) non-"native" encoding such as an improved version of UTF-EBCDIC or UTF-8 (*1))


NOTE (*10): The 'Quote and Quote-like Operators' section of the 'perlop' manpage (*13) reads also : << [...]

[7] The result is the character specified by the three-digit octal [...] or "\x{}" instead.

[8] Several constructs above specify a character by a number. That number gives the character's position in the character set encoding (indexed from 0). This is called synonymously its ordinal, code position, or code point. Perl works on platforms that have a native encoding currently of either ASCII/Latin1 or EBCDIC, each of which allow specification of 256 characters. In general, if the number is 255 (0xFF, 0377) or below, Perl interprets this in the platform's native encoding. If the number is 256 (0x100, 0400) or above, Perl interprets it as a Unicode code point and the result is the corresponding Unicode character. For example "\x{50}" and "\o{120}" both are the number 80 in decimal, which is less than 256, so the number is interpreted in the native character set encoding. In ASCII the character in the 80th position (indexed from 0) is the letter "P", and in EBCDIC it is the ampersand symbol "&". "\x{100}" and "\o{400}" are both 256 in decimal, so the number is interpreted as a Unicode code point no matter what the native encoding is. [...] >>


NOTE (*12) in other words, numbers X equal or greater than dec 256:

  • are going to not fit into a single byte value
  • they (*13) leave the critical, "native" char encoding battlefield X < dec 256 ,
  • and will probably undergo the blessing (or curse? :-) of preliminary attribution of Unicode meaning, i.e. probably of becoming Uni[code points] willing to be "translated" into the right bunch of byte values (*9)

NOTE (*13): case (*13b) , with the software which in some sense mysterious for now: is set up to use some US-ASCII-compatible char encoding (maybe the latin-1 encoding, but this may be less important)


NOTE (*13b): at least under a GNU-Linux system , as of perl v5.38.2 and April 2024 GMT , installed on a PC-like hardware


NOTE (*19): An important web page - - as of April 16 2024 GMT - - reads: << the [...] encoding [...] mislabeled as [...] ISO-8859-1 [...] , see [...]-1252 >>


NOTE (*20): From the 'Quote and Quote-like Operators' section of the 'perlop' manpage (*13) : <<

   [...]

       \x{263A}     [1,8]  hex char          (example shown: SMILEY)
       \x{ 263A }          Same, but shows optional blanks inside and
                           adjoining the braces
       \x1b         [2,8]  restricted range hex char (example: ESC)
       \N{name}     [3]    named Unicode character or character sequence
       \N{U+263D}   [4,8]  Unicode character (example: FIRST QUARTER MOON)
       \c[          [5]    control char      (example: chr(27))
       \o{23072}    [6,8]  octal char        (example: SMILEY)
       \033         [7,8]  restricted range octal char  (example: ESC)

   [...]
Hilaire answered 14/4 at 0:1 Comment(4)
This takes the long route to get to what the accepted answer already says.Parody
"This takes the long route to get to what the accepted answer already says" . Thank you for giving me an occasion for bothering readers with self-advocacy. Sometimes context and details are important, caring of both can cost space-time. Maybe some posts of mine are pretty clear about some issue which could turn out to be crucial for some problems.Hilaire
Yeah, but most of what you show are things the OP is not doing and nobody else should do anyway.Parody
"Yeah, but most of what you show are things the OP is not doing and nobody else should do anyway." / / / / Sometimes the absence of negative rating could be a sign of lack of ethics. As of perl v5.38.2 and April 2024 GMT an interaction between maaartinus and ikegami took place in Sep 2012 GMT and lasts since then in stackoverflow.com , a sign that squeezing out the discussion about some aspects of Perl and of life may turn out to be usefulHilaire

© 2022 - 2024 — McMap. All rights reserved.