Why does modern Perl avoid UTF-8 by default?
Asked Answered
S

7

594

I wonder why most modern solutions built using Perl don't enable UTF-8 by default.

I understand there are many legacy problems for core Perl scripts, where it may break things. But, from my point of view, in the 21st century, big new projects (or projects with a big perspective) should make their software UTF-8 proof from scratch. Still I don't see it happening. For example, Moose enables strict and warnings, but not Unicode. Modern::Perl reduces boilerplate too, but no UTF-8 handling.

Why? Are there some reasons to avoid UTF-8 in modern Perl projects in the year 2011?


Commenting @tchrist got too long, so I'm adding it here.

It seems that I did not make myself clear. Let me try to add some things.

tchrist and I see situation pretty similarly, but our conclusions are completely in opposite ends. I agree, the situation with Unicode is complicated, but this is why we (Perl users and coders) need some layer (or pragma) which makes UTF-8 handling as easy as it must be nowadays.

tchrist pointed to many aspects to cover, I will read and think about them for days or even weeks. Still, this is not my point. tchrist tries to prove that there is not one single way "to enable UTF-8". I have not so much knowledge to argue with that. So, I stick to live examples.

I played around with Rakudo and UTF-8 was just there as I needed. I didn't have any problems, it just worked. Maybe there are some limitation somewhere deeper, but at start, all I tested worked as I expected.

Shouldn't that be a goal in modern Perl 5 too? I stress it more: I'm not suggesting UTF-8 as the default character set for core Perl, I suggest the possibility to trigger it with a snap for those who develop new projects.

Another example, but with a more negative tone. Frameworks should make development easier. Some years ago, I tried web frameworks, but just threw them away because "enabling UTF-8" was so obscure. I did not find how and where to hook Unicode support. It was so time-consuming that I found it easier to go the old way. Now I saw here there was a bounty to deal with the same problem with Mason 2: How to make Mason2 UTF-8 clean?. So, it is pretty new framework, but using it with UTF-8 needs deep knowledge of its internals. It is like a big red sign: STOP, don't use me!

I really like Perl. But dealing with Unicode is painful. I still find myself running against walls. Some way tchrist is right and answers my questions: new projects don't attract UTF-8 because it is too complicated in Perl 5.

Sulfatize answered 28/5, 2011 at 15:12 Comment(16)
Hi Folks - there's a few flags raised here on these comments. What I've done is taken a snapshot of the comments here and dropped them into this chat room where you can carry on this discussion: chat.stackoverflow.com/rooms/846/…Scribe
I'm sorry but I agree with @tchrist -- UTF-8 is extremely hard. There's no framework or tool that just "flips a switch" and then handles it correctly. It's something you have to think about directly when designing your application -- not something any kind of framework or language can handle for you. If rakudo just happened to work for you, you were not adventurous enough with your test cases -- as it will take several of the examples in @tchrist's answer and butcher then.Lemma
What exactly are you hoping Moose or Modern::Perl will do? Magically make randomly-encoded character data in files and databases into valid data again?Agnusago
@Billy ONeal: looping over the @tchrist list, there is no one and only cure. I agree. Still is there some common level of UTF-8 handling, which is pluggable just so and which helps developer step into the game. I think, the knowledge in this new module utf8::all is very good beginning. If it (or similar functionality) were in core and perluniintro suggest it as quick start, would be much better.Sulfatize
@jrockway: what is the purpose on Modern::Perl? Reduce boilerplate and introduce best practices of nowadays technologies available in Perl. Including UTF-8 handling suits here very well, IMHO. Similar with Moose: it is modern object system for Perl. So, why not to make another step and include UTF-8 as default charset in Moose?Sulfatize
What does that mean? Moose has nothing to do with text manipulation. Why should it know about character encoding, much less choose a default one for you? (Anyway, the reason why the pragmas you list don't touch the encoding is because the convention is for Perl pragmas to affect lexical behavior. Assuming that the Entire World, other modules included, is UTF-8 is simply the Wrong Thing To Do. This isn't PHP or Ruby here.)Agnusago
(Also... "most Modern Perl apps" break on UTF-8? I've certainly never written an application, Perl or otherwise, that's not Unicode-clean.)Agnusago
Nb. tchrist (Tom Christiansen) posted his [training.perl.com/OSCON2011/index.html Tom Christiansen’s Materials for OSCON 2011] about Unicode. The one titled "Unicode Support Shootout: The Good, The Bad, & the (mostly) Ugly" talks about Unicode support in different programming languages. Only Google Go and Perl5 has support for full Unicode, only Google Go builtin (no mention of Perl6).Segment
Is your question specifically about any one operating system? The most voted answer seems to be Linux specific. Or at least specific to Unices other than MacOS X.Anaphylaxis
@hippietrail: i work mostly on Linux but i have seen a lot UTF-8 related Perl questions related to Win too. I have too few knowledge with MacOS X, but as far i understand, the same questions should be actual in Mac too. If not, i am glad about it and looking forward to work soon with Perl on Mac.Sulfatize
If I'm on a POSIX system and have ENV['LC_ALL'] set to e.g. "en_US.UTF-8", then that is an explicit statement of intent which Perl should honor by assuming that its standard input is encoded as UTF-8, and encoding its standard output likewise. If my code breaks because it doesn't handle some of the many subtleties of Unicode, maybe I shouldn't run it in an environment that claims to be Unicode. I don't understand why Perl should ignore the locale settings in favor of whatever the heck its default is.Yorick
I've not looked into it much, but utf8::all seems to work for my basic needs. FWIW, I think the sort of (public) simplicity of utf-8 use in Java is something Perl could hugely benefit from.Beaton
I know, this is a bit offtopic and trolly, but why not get rid of anachronistic languages like Perl and PHP and just use Python and have unicode be the default. To convert to a specific encoding,do 'string'.encode('utf-8') (you get b'string') and to convert that binary string back to unicode, do b'string'.decode('utf-8') (you get 'string'). Now you can stop thinking about it. That would be my way of getting things done in 2019. Being old usually means being stable, but it often also means not getting rid of ugly ways to do things (this of course also affects Python).Maeda
@Nils Because if you have to worry about encoding and decoding binary bit patterns, you're doing it wrong. UTF-8 is nothing but an encoding, and you should never have to think about its individual, constituent byte-sized code units. At most you should be thinking of abstract code points — and not whether they're big- or little-ending either. :) Encoding and decoding should virtually always happen at the very boundaries of the interface layering for interchange with external entities. Trust me, intra-converting code points w/bit patterns is the least of your worries when it comes to Unicode.Sheerlegs
@Sheerlegs Im not sure if i get your point. Python uses unicode internally everywhere and there is no need to worry about bits and bytes. len('aou') == len('äöü') == len('테스트'). If a module has no encoding declaration python assumes utf-8 and decodes it to unicode. Windows filesystem and console encoding was changed to UTF-8 in v3.6 All relevant python 3 libraries encode in utf-8 and internally use unicode. Only when open() ing files in text mode without the encoding parameter (which no library does) python will still prefer locale.getpreferredencoding().Maeda
It will change in Perl 7.Epiglottis
S
1202

𝙎𝙞𝙢𝙥𝙡𝙚𝙨𝙩 : 𝟕 𝘿𝙞𝙨𝙘𝙧𝙚𝙩𝙚 𝙍𝙚𝙘𝙤𝙢𝙢𝙚𝙣𝙙𝙖𝙩𝙞𝙤𝙣𝙨

  1. Set your PERL_UNICODE envariable to AS. This makes all Perl scripts decode @ARGV as UTF‑8 strings, and sets the encoding of all three of stdin, stdout, and stderr to UTF‑8. Both these are global effects, not lexical ones.

  2. At the top of your source file (program, module, library, dohickey), prominently assert that you are running perl version 5.12 or better via:

    use v5.12;  # minimal for unicode string feature
    use v5.14;  # optimal for unicode string feature
    
  3. Enable warnings, since the previous declaration only enables strictures and features, not warnings. I also suggest promoting Unicode warnings into exceptions, so use both these lines, not just one of them. Note however that under v5.14, the utf8 warning class comprises three other subwarnings which can all be separately enabled: nonchar, surrogate, and non_unicode. These you may wish to exert greater control over.

    use warnings;
    use warnings qw( FATAL utf8 );
    
  4. Declare that this source unit is encoded as UTF‑8. Although once upon a time this pragma did other things, it now serves this one singular purpose alone and no other:

    use utf8;
    
  5. Declare that anything that opens a filehandle within this lexical scope but not elsewhere is to assume that that stream is encoded in UTF‑8 unless you tell it otherwise. That way you do not affect other module’s or other program’s code.

    use open qw( :encoding(UTF-8) :std );
    
  6. Enable named characters via \N{CHARNAME}.

    use charnames qw( :full :short );
    
  7. If you have a DATA handle, you must explicitly set its encoding. If you want this to be UTF‑8, then say:

    binmode(DATA, ":encoding(UTF-8)");
    

There is of course no end of other matters with which you may eventually find yourself concerned, but these will suffice to approximate the state goal to “make everything just work with UTF‑8”, albeit for a somewhat weakened sense of those terms.

One other pragma, although it is not Unicode related, is:

      use autodie;

It is strongly recommended.

🌴 🐪🐫🐪 🌞 𝕲𝖔 𝕿𝖍𝖔𝖚 𝖆𝖓𝖉 𝕯𝖔 𝕷𝖎𝖐𝖊𝖜𝖎𝖘𝖊 🌞 🐪🐫🐪 🐁


🎁 🐪 𝕭𝖔𝖎𝖑𝖊𝖗⸗𝖕𝖑𝖆𝖙𝖊 𝖋𝖔𝖗 𝖀𝖓𝖎𝖈𝖔𝖉𝖊⸗𝕬𝖜𝖆𝖗𝖊 𝕮𝖔𝖉𝖊 🐪 🎁


My own boilerplate these days tends to look like this:

use 5.014;

use utf8;
use strict;
use autodie;
use warnings; 
use warnings    qw< FATAL  utf8     >;
use open        qw< :std  :utf8     >;
use charnames   qw< :full >;
use feature     qw< unicode_strings >;

use File::Basename      qw< basename >;
use Carp                qw< carp croak confess cluck >;
use Encode              qw< encode decode >;
use Unicode::Normalize  qw< NFD NFC >;

END { close STDOUT }

if (grep /\P{ASCII}/ => @ARGV) { 
   @ARGV = map { decode("UTF-8", $_) } @ARGV;
}

$0 = basename($0);  # shorter messages
$| = 1;

binmode(DATA, ":utf8");

# give a full stack dump on any untrapped exceptions
local $SIG{__DIE__} = sub {
    confess "Uncaught exception: @_" unless $^S;
};

# now promote run-time warnings into stack-dumped
#   exceptions *unless* we're in an try block, in
#   which case just cluck the stack dump instead
local $SIG{__WARN__} = sub {
    if ($^S) { cluck   "Trapped warning: @_" } 
    else     { confess "Deadly warning: @_"  }
};

while (<>)  {
    chomp;
    $_ = NFD($_);
    ...
} continue {
    say NFC($_);
}

__END__

🎅 𝕹 𝖔 𝕸 𝖆 𝖌 𝖎 𝖈 𝕭 𝖚 𝖑 𝖑 𝖊 𝖙 🎅


Saying that “Perl should [somehow!] enable Unicode by default” doesn’t even start to begin to think about getting around to saying enough to be even marginally useful in some sort of rare and isolated case. Unicode is much much more than just a larger character repertoire; it’s also how those characters all interact in many, many ways.

Even the simple-minded minimal measures that (some) people seem to think they want are guaranteed to miserably break millions of lines of code, code that has no chance to “upgrade” to your spiffy new Brave New World modernity.

It is way way way more complicated than people pretend. I’ve thought about this a huge, whole lot over the past few years. I would love to be shown that I am wrong. But I don’t think I am. Unicode is fundamentally more complex than the model that you would like to impose on it, and there is complexity here that you can never sweep under the carpet. If you try, you’ll break either your own code or somebody else’s. At some point, you simply have to break down and learn what Unicode is about. You cannot pretend it is something it is not.

🐪 goes out of its way to make Unicode easy, far more than anything else I’ve ever used. If you think this is bad, try something else for a while. Then come back to 🐪: either you will have returned to a better world, or else you will bring knowledge of the same with you so that we can make use of your new knowledge to make 🐪 better at these things.


💡 𝕴𝖉𝖊𝖆𝖘 𝖋𝖔𝖗 𝖆 𝖀𝖓𝖎𝖈𝖔𝖉𝖊 ⸗ 𝕬𝖜𝖆𝖗𝖊 🐪 𝕷𝖆𝖚𝖓𝖉𝖗𝖞 𝕷𝖎𝖘𝖙 💡


At a minimum, here are some things that would appear to be required for 🐪 to “enable Unicode by default”, as you put it:

  1. All 🐪 source code should be in UTF-8 by default. You can get that with use utf8 or export PERL5OPTS=-Mutf8.

  2. The 🐪 DATA handle should be UTF-8. You will have to do this on a per-package basis, as in binmode(DATA, ":encoding(UTF-8)").

  3. Program arguments to 🐪 scripts should be understood to be UTF-8 by default. export PERL_UNICODE=A, or perl -CA, or export PERL5OPTS=-CA.

  4. The standard input, output, and error streams should default to UTF-8. export PERL_UNICODE=S for all of them, or I, O, and/or E for just some of them. This is like perl -CS.

  5. Any other handles opened by 🐪 should be considered UTF-8 unless declared otherwise; export PERL_UNICODE=D or with i and o for particular ones of these; export PERL5OPTS=-CD would work. That makes -CSAD for all of them.

  6. Cover both bases plus all the streams you open with export PERL5OPTS=-Mopen=:utf8,:std. See uniquote.

  7. You don’t want to miss UTF-8 encoding errors. Try export PERL5OPTS=-Mwarnings=FATAL,utf8. And make sure your input streams are always binmoded to :encoding(UTF-8), not just to :utf8.

  8. Code points between 128–255 should be understood by 🐪 to be the corresponding Unicode code points, not just unpropertied binary values. use feature "unicode_strings" or export PERL5OPTS=-Mfeature=unicode_strings. That will make uc("\xDF") eq "SS" and "\xE9" =~ /\w/. A simple export PERL5OPTS=-Mv5.12 or better will also get that.

  9. Named Unicode characters are not by default enabled, so add export PERL5OPTS=-Mcharnames=:full,:short,latin,greek or some such. See uninames and tcgrep.

  10. You almost always need access to the functions from the standard Unicode::Normalize module various types of decompositions. export PERL5OPTS=-MUnicode::Normalize=NFD,NFKD,NFC,NFKD, and then always run incoming stuff through NFD and outbound stuff from NFC. There’s no I/O layer for these yet that I’m aware of, but see nfc, nfd, nfkd, and nfkc.

  11. String comparisons in 🐪 using eq, ne, lc, cmp, sort, &c&cc are always wrong. So instead of @a = sort @b, you need @a = Unicode::Collate->new->sort(@b). Might as well add that to your export PERL5OPTS=-MUnicode::Collate. You can cache the key for binary comparisons.

  12. 🐪 built-ins like printf and write do the wrong thing with Unicode data. You need to use the Unicode::GCString module for the former, and both that and also the Unicode::LineBreak module as well for the latter. See uwc and unifmt.

  13. If you want them to count as integers, then you are going to have to run your \d+ captures through the Unicode::UCD::num function because 🐪’s built-in atoi(3) isn’t currently clever enough.

  14. You are going to have filesystem issues on 👽 filesystems. Some filesystems silently enforce a conversion to NFC; others silently enforce a conversion to NFD. And others do something else still. Some even ignore the matter altogether, which leads to even greater problems. So you have to do your own NFC/NFD handling to keep sane.

  15. All your 🐪 code involving a-z or A-Z and such MUST BE CHANGED, including m//, s///, and tr///. It’s should stand out as a screaming red flag that your code is broken. But it is not clear how it must change. Getting the right properties, and understanding their casefolds, is harder than you might think. I use unichars and uniprops every single day.

  16. Code that uses \p{Lu} is almost as wrong as code that uses [A-Za-z]. You need to use \p{Upper} instead, and know the reason why. Yes, \p{Lowercase} and \p{Lower} are different from \p{Ll} and \p{Lowercase_Letter}.

  17. Code that uses [a-zA-Z] is even worse. And it can’t use \pL or \p{Letter}; it needs to use \p{Alphabetic}. Not all alphabetics are letters, you know!

  18. If you are looking for 🐪 variables with /[\$\@\%]\w+/, then you have a problem. You need to look for /[\$\@\%]\p{IDS}\p{IDC}*/, and even that isn’t thinking about the punctuation variables or package variables.

  19. If you are checking for whitespace, then you should choose between \h and \v, depending. And you should never use \s, since it DOES NOT MEAN [\h\v], contrary to popular belief.

  20. If you are using \n for a line boundary, or even \r\n, then you are doing it wrong. You have to use \R, which is not the same!

  21. If you don’t know when and whether to call Unicode::Stringprep, then you had better learn.

  22. Case-insensitive comparisons need to check for whether two things are the same letters no matter their diacritics and such. The easiest way to do that is with the standard Unicode::Collate module. Unicode::Collate->new(level => 1)->cmp($a, $b). There are also eq methods and such, and you should probably learn about the match and substr methods, too. These are have distinct advantages over the 🐪 built-ins.

  23. Sometimes that’s still not enough, and you need the Unicode::Collate::Locale module instead, as in Unicode::Collate::Locale->new(locale => "de__phonebook", level => 1)->cmp($a, $b) instead. Consider that Unicode::Collate::->new(level => 1)->eq("d", "ð") is true, but Unicode::Collate::Locale->new(locale=>"is",level => 1)->eq("d", " ð") is false. Similarly, "ae" and "æ" are eq if you don’t use locales, or if you use the English one, but they are different in the Icelandic locale. Now what? It’s tough, I tell you. You can play with ucsort to test some of these things out.

  24. Consider how to match the pattern CVCV (consonsant, vowel, consonant, vowel) in the string “niño”. Its NFD form — which you had darned well better have remembered to put it in — becomes “nin\x{303}o”. Now what are you going to do? Even pretending that a vowel is [aeiou] (which is wrong, by the way), you won’t be able to do something like (?=[aeiou])\X) either, because even in NFD a code point like ‘ø’ does not decompose! However, it will test equal to an ‘o’ using the UCA comparison I just showed you. You can’t rely on NFD, you have to rely on UCA.


💩 𝔸 𝕤 𝕤 𝕦 𝕞 𝕖 𝔹 𝕣 𝕠 𝕜 𝕖 𝕟 𝕟 𝕖 𝕤 𝕤 💩


And that’s not all. There are a million broken assumptions that people make about Unicode. Until they understand these things, their 🐪 code will be broken.

  1. Code that assumes it can open a text file without specifying the encoding is broken.

  2. Code that assumes the default encoding is some sort of native platform encoding is broken.

  3. Code that assumes that web pages in Japanese or Chinese take up less space in UTF‑16 than in UTF‑8 is wrong.

  4. Code that assumes Perl uses UTF‑8 internally is wrong.

  5. Code that assumes that encoding errors will always raise an exception is wrong.

  6. Code that assumes Perl code points are limited to 0x10_FFFF is wrong.

  7. Code that assumes you can set $/ to something that will work with any valid line separator is wrong.

  8. Code that assumes roundtrip equality on casefolding, like lc(uc($s)) eq $s or uc(lc($s)) eq $s, is completely broken and wrong. Consider that the uc("σ") and uc("ς") are both "Σ", but lc("Σ") cannot possibly return both of those.

  9. Code that assumes every lowercase code point has a distinct uppercase one, or vice versa, is broken. For example, "ª" is a lowercase letter with no uppercase; whereas both "ᵃ" and "ᴬ" are letters, but they are not lowercase letters; however, they are both lowercase code points without corresponding uppercase versions. Got that? They are not \p{Lowercase_Letter}, despite being both \p{Letter} and \p{Lowercase}.

  10. Code that assumes changing the case doesn’t change the length of the string is broken.

  11. Code that assumes there are only two cases is broken. There’s also titlecase.

  12. Code that assumes only letters have case is broken. Beyond just letters, it turns out that numbers, symbols, and even marks have case. In fact, changing the case can even make something change its main general category, like a \p{Mark} turning into a \p{Letter}. It can also make it switch from one script to another.

  13. Code that assumes that case is never locale-dependent is broken.

  14. Code that assumes Unicode gives a fig about POSIX locales is broken.

  15. Code that assumes you can remove diacritics to get at base ASCII letters is evil, still, broken, brain-damaged, wrong, and justification for capital punishment.

  16. Code that assumes that diacritics \p{Diacritic} and marks \p{Mark} are the same thing is broken.

  17. Code that assumes \p{GC=Dash_Punctuation} covers as much as \p{Dash} is broken.

  18. Code that assumes dash, hyphens, and minuses are the same thing as each other, or that there is only one of each, is broken and wrong.

  19. Code that assumes every code point takes up no more than one print column is broken.

  20. Code that assumes that all \p{Mark} characters take up zero print columns is broken.

  21. Code that assumes that characters which look alike are alike is broken.

  22. Code that assumes that characters which do not look alike are not alike is broken.

  23. Code that assumes there is a limit to the number of code points in a row that just one \X can match is wrong.

  24. Code that assumes \X can never start with a \p{Mark} character is wrong.

  25. Code that assumes that \X can never hold two non-\p{Mark} characters is wrong.

  26. Code that assumes that it cannot use "\x{FFFF}" is wrong.

  27. Code that assumes a non-BMP code point that requires two UTF-16 (surrogate) code units will encode to two separate UTF-8 characters, one per code unit, is wrong. It doesn’t: it encodes to single code point.

  28. Code that transcodes from UTF‐16 or UTF‐32 with leading BOMs into UTF‐8 is broken if it puts a BOM at the start of the resulting UTF-8. This is so stupid the engineer should have their eyelids removed.

  29. Code that assumes the CESU-8 is a valid UTF encoding is wrong. Likewise, code that thinks encoding U+0000 as "\xC0\x80" is UTF-8 is broken and wrong. These guys also deserve the eyelid treatment.

  30. Code that assumes characters like > always points to the right and < always points to the left are wrong — because they in fact do not.

  31. Code that assumes if you first output character X and then character Y, that those will show up as XY is wrong. Sometimes they don’t.

  32. Code that assumes that ASCII is good enough for writing English properly is stupid, shortsighted, illiterate, broken, evil, and wrong. Off with their heads! If that seems too extreme, we can compromise: henceforth they may type only with their big toe from one foot. (The rest will be duct taped.)

  33. Code that assumes that all \p{Math} code points are visible characters is wrong.

  34. Code that assumes \w contains only letters, digits, and underscores is wrong.

  35. Code that assumes that ^ and ~ are punctuation marks is wrong.

  36. Code that assumes that ü has an umlaut is wrong.

  37. Code that believes things like contain any letters in them is wrong.

  38. Code that believes \p{InLatin} is the same as \p{Latin} is heinously broken.

  39. Code that believe that \p{InLatin} is almost ever useful is almost certainly wrong.

  40. Code that believes that given $FIRST_LETTER as the first letter in some alphabet and $LAST_LETTER as the last letter in that same alphabet, that [${FIRST_LETTER}-${LAST_LETTER}] has any meaning whatsoever is almost always complete broken and wrong and meaningless.

  41. Code that believes someone’s name can only contain certain characters is stupid, offensive, and wrong.

  42. Code that tries to reduce Unicode to ASCII is not merely wrong, its perpetrator should never be allowed to work in programming again. Period. I’m not even positive they should even be allowed to see again, since it obviously hasn’t done them much good so far.

  43. Code that believes there’s some way to pretend textfile encodings don’t exist is broken and dangerous. Might as well poke the other eye out, too.

  44. Code that converts unknown characters to ? is broken, stupid, braindead, and runs contrary to the standard recommendation, which says NOT TO DO THAT! RTFM for why not.

  45. Code that believes it can reliably guess the encoding of an unmarked textfile is guilty of a fatal mélange of hubris and naïveté that only a lightning bolt from Zeus will fix.

  46. Code that believes you can use 🐪 printf widths to pad and justify Unicode data is broken and wrong.

  47. Code that believes once you successfully create a file by a given name, that when you run ls or readdir on its enclosing directory, you’ll actually find that file with the name you created it under is buggy, broken, and wrong. Stop being surprised by this!

  48. Code that believes UTF-16 is a fixed-width encoding is stupid, broken, and wrong. Revoke their programming licence.

  49. Code that treats code points from one plane one whit differently than those from any other plane is ipso facto broken and wrong. Go back to school.

  50. Code that believes that stuff like /s/i can only match "S" or "s" is broken and wrong. You’d be surprised.

  51. Code that uses \PM\pM* to find grapheme clusters instead of using \X is broken and wrong.

  52. People who want to go back to the ASCII world should be whole-heartedly encouraged to do so, and in honor of their glorious upgrade they should be provided gratis with a pre-electric manual typewriter for all their data-entry needs. Messages sent to them should be sent via an ᴀʟʟᴄᴀᴘs telegraph at 40 characters per line and hand-delivered by a courier. STOP.


😱 𝕾 𝖀 𝕸 𝕸 𝕬 𝕽 𝖄 😱


I don’t know how much more “default Unicode in 🐪” you can get than what I’ve written. Well, yes I do: you should be using Unicode::Collate and Unicode::LineBreak, too. And probably more.

As you see, there are far too many Unicode things that you really do have to worry about for there to ever exist any such thing as “default to Unicode”.

What you’re going to discover, just as we did back in 🐪 5.8, that it is simply impossible to impose all these things on code that hasn’t been designed right from the beginning to account for them. Your well-meaning selfishness just broke the entire world.

And even once you do, there are still critical issues that require a great deal of thought to get right. There is no switch you can flip. Nothing but brain, and I mean real brain, will suffice here. There’s a heck of a lot of stuff you have to learn. Modulo the retreat to the manual typewriter, you simply cannot hope to sneak by in ignorance. This is the 21ˢᵗ century, and you cannot wish Unicode away by willful ignorance.

You have to learn it. Period. It will never be so easy that “everything just works,” because that will guarantee that a lot of things don’t work — which invalidates the assumption that there can ever be a way to “make it all work.”

You may be able to get a few reasonable defaults for a very few and very limited operations, but not without thinking about things a whole lot more than I think you have.

As just one example, canonical ordering is going to cause some real headaches. 😭"\x{F5}" ‘õ’, "o\x{303}" ‘õ’, "o\x{303}\x{304}" ‘ȭ’, and "o\x{304}\x{303}" ‘ō̃’ should all match ‘õ’, but how in the world are you going to do that? This is harder than it looks, but it’s something you need to account for. 💣

If there’s one thing I know about Perl, it is what its Unicode bits do and do not do, and this thing I promise you: “ ̲ᴛ̲ʜ̲ᴇ̲ʀ̲ᴇ̲ ̲ɪ̲s̲ ̲ɴ̲ᴏ̲ ̲U̲ɴ̲ɪ̲ᴄ̲ᴏ̲ᴅ̲ᴇ̲ ̲ᴍ̲ᴀ̲ɢ̲ɪ̲ᴄ̲ ̲ʙ̲ᴜ̲ʟ̲ʟ̲ᴇ̲ᴛ̲ ̲ ” 😞

You cannot just change some defaults and get smooth sailing. It’s true that I run 🐪 with PERL_UNICODE set to "SA", but that’s all, and even that is mostly for command-line stuff. For real work, I go through all the many steps outlined above, and I do it very, ** very** carefully.


😈 ¡ƨdləɥ ƨᴉɥʇ ədoɥ puɐ ʻλɐp əɔᴉu ɐ əʌɐɥ ʻʞɔnl poo⅁ 😈

Sheerlegs answered 28/5, 2011 at 17:9 Comment(48)
Like Sherm Pendley pointed: "All!". If i write today something new, UTF-8 should be easiest way to get things done. It is not. Your boilerplate prooves it. Not everyone has such knowledge to turn so many tumblers to right positions. I'm sorry, i had long and hard day, so i will comment in main entry tomorrow more with examples.Sulfatize
@wk: So it’s cool that code like perl -i.bak -pe 's/foo/bar' breaks? There’s a helluva lot of that in the world. What sort of comparison do you want for eq? A UCA3 compare? Does lc turn it into UCA1? How can you know? How’ll you match partial and/or discontiguous glyphs? Is it OK that all old code with 8-bit data in it now fails to compile? Is it ok that Perl no longer works on binary data? Is it ok to get different answers? Is it ok to diddle a-z out from under people without their consent? Is it ok to break-up graphemes? Is a 100x slowdown in sort code acceptable? What about filesys?Sheerlegs
@tchrist: why should it break some old code, when we enable unicode use in new projects? Lets forget legacy code and core Perl. For example, is there any reason to avoid UTF-8 in Moose-based projects? If not, i think that Moose could enable UTF-8 support as widely as possible as it enables warnings and strict pragma. Now we just wasting time, because there is already a lot code written with Moose, which may break ;)Sulfatize
One conclusion should be obvious from reading the list above: Don't case-fold. Just don't. Ever. Computationally expensive and with semantics that depend crucially on whatever it is that "locale" tries unsuccessfully to identify.Erasure
Am I the only one who finds it ironic that this post by tchrist renders so wildly different on FF/Chrome/IE/Opera, sometime to the point of illegibility?Disulfide
@wk: the obvious problem will be utf8 bleed, data travelling from unicode-aware contexts to non-unicode-aware contexts ( ie: the 3rd party code you use in your code, your database, your host environment ( OS , filesystem, etc ) ) which can have dangerous circumstances. If you're not using prepared/bound queries for your Database interface, I have bad news for you.Excruciating
curious though if it was your intent to have many of the unicode characters unreadable due to lack of font support.Dehydrate
@Kent: If they've not used prepared queries, they'll probably have some other surprises coming their way too, SQL injection being one of the script kiddies favorite attack methods these days. Moreover, their code will be slow…Juliusjullundur
While I generally like the post, and did upvote, one thing bugs the hell out of me. There is a lot of "code that ... is broken". While I don't argue with the statement, I think it would be good to show the brokenness. In this way it would traverse (this part of answer) from a rant, to education.Cluck
Though I don't fully agree with some implications of this answer (I feel there is indeed some problem with the Perl CULTURE in relation with Unicode), that is matter of discussion: this is a great answer and this is what makes SO so valuable. I agree specially with the 'assume brokeness' general motto (not onyl for perl)Internment
@Dehydrate No I didn’t use intentionally problematic code points; it’s a plot to get you to install George Douros’s super-awesome Symbola font, which covers Unicode 6.0. 😈 @depesz There isn’t room here to explain why each broken assuption is wrong. @Internment Lots and lots of this applies to Unicode in general, not just Perl. Some of this material may show up in 🐪 Programming Perl 🐪, 4th edition, due out in October. 🎃 I’ve one month left to ✍ work on it, and Unicode is ᴍᴇɢᴀ there; regexes, tooSheerlegs
@tchrist: thank you for your great answer, it helped to see a big picture. Still i believe, that to resolve all those problems you pointed will take at least 10 years. And to resolve them effectively we need use Unicode day by day. If we say "Unicode issues are too complex, lets resolve them before and make an ideal solution, use then", we can't move forward. And most growing software will not incorporate UTF-8 until this Great Day, even at minimum level. To have some clear point to rely on is a must (like utf8::all, but i prefer it in core). You may call it naïveté.Sulfatize
Even with the Symbola font installed, MSIE 9 does not render the camels and other symbols. Firefox 3.6 on the same Windows 7 PC does renber all characters.Jehovist
I installed Symbola, it doesn't fix it in Chrome. Wonder if I need to restart? Unicode is hard.Generable
@Smackfu: Symbola made it work just fine for me under Chrome, which is pretty much as good as Opera. Safari has the right glyphs, but seems to have non-scaling ideas of certain text blocks. Wonder why your Chrome isn’t good but mine is?Sheerlegs
@leonbloy: You said “I feel there is indeed some problem with the Perl CULTURE in relation with Unicode”, and I am 𝔼𝕏𝕋ℝ𝔼𝕄𝔼𝕃𝕐 interested in hearing more about your point of view here. I happen to agree with you, but I don’t want to “lead the witness” and put words in your mouth. If there isn’t enough room here to go into it, please don’t hesitate to send me mail about this at my standard address of 𝕿𝖔𝖒 𝕮𝖍𝖗𝖎𝖘𝖙𝖎𝖆𝖓𝖘𝖊𝖓 <𝖙𝖈𝖍𝖗𝖎𝖘𝖙@𝖕𝖊𝖗𝖑。𝖈𝖔𝖒> — 𝒔𝒊𝒄𝒖𝒕 𝒊𝒏 𝒑𝒓𝒊𝒏𝒄𝒊𝒑𝒊𝒐 𝒆𝒕 𝒏𝒖𝒏𝒄 𝒆𝒕 𝒔𝒆𝒎𝒑𝒆𝒓.Sheerlegs
The "use strict" in the boilerplate is superfluous, if you've said "use 5.14.0" then it's on by default.Homochromous
@Mark: No, it is not superfluous. I don't know who is going to decide that they don’t want to go all the way to 5.14. If they back down far enough, the strict goes away, and I never want that to happen. Therefore it is not superfluous. Plus it is declarative and therefore useful. Similarly, I like to make the unicode_strings feature explicit so that people realize it’s in effect. That's like how I often initialize things to 0 even when I don't have to: I like to signal my intent. I’m not fond of secret side-effects.Sheerlegs
Interestingly, after installing the Symbola fonts (Ubuntu/Chromium), some of the symbols show up, the others that are still boxes, if I highlight and right click, Chromium offers to search google for that character, which is shown perfectly in the context menu!Fann
Perfect answer. But the main point of the question is still here. In the 21th century SHOULD be working with unicode much, much, much easier and more intuitive. Yes, understand than "no magic bullet is here". But the framework developers (like the above Mason2) really SHOULD care about it. Yes, I understand than it is a volunteer work and when i don't like the framework, it is easy do not useing it. But the all unicode madness in the perl really HURTS perl itself.Infection
@jm666 This much I grant you: that we should adopt a zero-tolerance policy vis-à-vis Unicode compatibility in all new code. Yes, you will have to make a distinction between binary files of bytes and text files with characters in them, but the sorely abused Pʀɪsᴏɴᴇʀs Oꜰ Bɪʟʟ have had to do this since time immemorial. As far as I am concerned, any new code that deals with text MUST ASSUME AND UNDERSTAND UNICODE. I give suggestions above for how one might selectively upgrade some existing programs through envariables. But everybody needs to know Unicode. This has 0 to do with Perl.Sheerlegs
@Sheerlegs - you're right and i understand and agree with your answer. 1. (as you told) module developers. I love volunteer developers - but in 21th century should be zero tolerance for not Unicode ready CPAN submissions. Simply delete, or (at least FLAG them). Non Unicode ready modules HURTs perl!. 2. perl6 - i hope than perl6 will have default utf8 enabled (because don't need maintain backward compat. - honestly, i know nothing about perl6 yet) 3. something like uni::perl (i'm using it) (or something like your broilerplate) should be in CORE - for easy enabling all common utf8 features.Infection
@ysth: Considering that I personally have specific permission from Tim to use the 🐪 when discussing Perl for my website, writings, and business, and considering that I’m the primary author of the 4th edition of Programming Perl that I still haven’t finished the draft of, I find it highly unlikely that Tim would be annoyed. I certainly hope not. I’m sure I could spin this as an advert for him if I were hard-pressed to do so.Sheerlegs
@tchrist: it was a joke. but can the 4th edition s/referent/thingy/ ? I liked it soo much better.Cherriecherrita
@ysth: I dunno. In 2E we took a lot of flak for thingie. So in 3E Jon and I, and perhaps also Damian, prevailed upon Larry to go for referent. But I confess I have at times resorting to thingie again. But it seems a strange mix having invocant on one side and thingiemadongle on the other, eh? Larry has Right-of-Last-Edit on 4E, so we’ll see what he does when he gets to those chapters (which I’m already done with).Sheerlegs
Just found than the boilerplate not working fully because of the bug in the "autodie". When using open qw(:utf8 :std) pragma, the "use autodie" somewhat turn it off. So either open, or autodie - not both... ;) (old perl bug: #4959884)Infection
@jm666: Yes, that’s right. I forgot to mention that. I found it, too. It’s rather annoying. And technically, it should really be use open qw<:encoding(UTF-8) :std> because you should be using the strict version of utf8, not the loose one.Sheerlegs
Yes, :) but even so is hard to convert the boilerplate into the package for possibility "use My::CorrectUtfPerl". (as you can see in the https://mcmap.net/q/65686/-how-to-make-quot-use-my-defaults-quot-with-modern-perl-amp-utf8-defaults/632407)Infection
"Code that assumes that ü has an umlaut is wrong." - Why? I search and found 2 articles about that theme: en.wikipedia.org/wiki/%C3%9C -> en.wikipedia.org/wiki/Diaeresis_(diacritic). Quote from second article: "The two uses originated separately, with the diaeresis being considerably older. In modern computer systems using Unicode, the umlaut and diaeresis diacritics are identical: ‹ä› represents both a-umlaut and a-diaeresis." Is it true for Perl or not?Liege
@nordicdyno: When in NFC it doesn’t have COMBINING DIAERESIS. Also, there are lookalikes like NKO COMBINING DOUBLE DOT ABOVE. But yes, the name of the mark is diaeresis. The two functions are different linquistically: in the Spanish word Argüelles for example, there is no umlaut happening, and similarly in French naïve. The point is that you often cannot judge something by its appearance.Sheerlegs
@Sheerlegs how can I tell if I'm seeing your post entirely correctly? I've installed the symbola font, which has improved things considerably, but there are still some white squares -- are there supposed to be? I need a unit test!!!Soniasonic
Re "Code that believes someone’s name can only contain certain characters is stupid, offensive, and wrong.", Did Unicode adopt the new name of the artist formerly known as Prince?Stereochemistry
Amazingly good answer, very useful even for us who don't spend much time in perl :) BTW, on 23. (locale-dependent collate), you've got " ð" where it should be "ð" (making it trivially non-equal =P)Chaffer
@Sheerlegs "Code that tries to reduce Unicode to ASCII is not merely wrong, its perpetrator should never be allowed to work in programming again." So you're firing the Stack Exchange team?Crumble
@tchrist: But seriously, +1 for the broken assumptions section. These need to be more widely known.Crumble
What does (P<sub>x</sub>) after 𝙎𝙞𝙢𝙥𝙡𝙚𝙨𝙩 mean?Valuate
@J.F.Sebastian That's the PRESCRIPTION TAKE code point. It's the Rx symbol.Sheerlegs
training.perl.com is currently down, but utilities such as unifmt can be found on CPAN, e.g. search.cpan.org/perldoc?unifmtStringhalt
This is an amazlingly great answer! But I must nitpick an important point. Perl runs happily on many platforms, but this answer seems to only treat mainstream Unix style OSes. For instance all the export FOO=BAR won't work on Windows, some of the stuff about "alien" filesystems will be wrong as Windows uses UTF-16 and Mac OS X though it uses UTF-8 enforces a specific normalization form which doesn't change as new Unicode editions emerge. Running on those OSes they will be native filesystems and Unix filesystems will be the alien ones.Anaphylaxis
"Code that believes once you successfully create a file by a given name, that when you run ls or readdir on its enclosing directory, you’ll actually find that file with the name you created it under is buggy, broken, and wrong." - I say any filesystem API that does that to it's user is the source of problem, and it - not user's code - should be fixed. Why would this kind of behaviour be considered correct? In your words, authors of such filesystem APIs should be... well, pick any horrific punishment yourself, you seem to be VERY good at it.Tillietillinger
(1) A lot of what you say is true, in theory, but in practice-- well, very few people are talented enough to simultaneously handle every stipulation you make and ever get anything useful done; certainly not large enough number to service even a significant fraction of the companies who need perl programmers. (2) Even if enabling the boilerplate you include by default is insanity, there is no reason that some future version of Perl can't have the corresponding use $version pragma enable it on-demand. It's even crazier that ~50 lines of code are required just to 'enable Unicode' in perl.Hannigan
A final note is that many of the stipulations made here imply that individuals (including native speakers) of any given language will have an absolutely correct understanding of their own language; this is never true in the general case, and so rarely true of specific individuals that it can be dismissed entirely as a possibility; it also assumes "correct" behavior is always desired! Programs exist to serve people, not to be perfect-- correctness is simply a side effect of performing the desired behavior consistently-- even when the desired behavior is "incorrect"!Hannigan
I think that as of perl 5.16 you can use the fc() builtin instead of Unicode::CaseFold.Proletarian
binmode(DATA, ":utf8"); is redundant with use utf8; since DATA is simply the file handle the Perl parser uses to read the file.Stereochemistry
@Stereochemistry You’re right. You only need it if you don’t have use utf8; in that compilation unit. Because a lot of times you don’t, it’s listed separately, as use open won’t catch it.Sheerlegs
You forgot to mention number 53: Code that assumes that characters have properties that can be trusted is broken. Every edition of the Unicode standard has things that are broken in it. You cite a perfect example: upper-case superscript "A" is lower-case. Under no circumstances can that be correct, so you can bet that that will change in the future, breaking all code you write today. And I know all about the "guarantees" that Unicode makes regarding future-proofing, they are almost as broken as Unicode is.Deodorant
It's 2015 and I've a fully patched OS/browser (Firefox on Windows 7) and parts of that answer still don't render correctly. What do I do to see this answer as it was supposed to be seen (or are all the bad control characters part of the point?)Referential
@Alex It’s always looked fine on Macs.Sheerlegs
A
101

There are two stages to processing Unicode text. The first is "how can I input it and output it without losing information". The second is "how do I treat text according to local language conventions".

tchrist's post covers both, but the second part is where 99% of the text in his post comes from. Most programs don't even handle I/O correctly, so it's important to understand that before you even begin to worry about normalization and collation.

This post aims to solve that first problem

When you read data into Perl, it doesn't care what encoding it is. It allocates some memory and stashes the bytes away there. If you say print $str, it just blits those bytes out to your terminal, which is probably set to assume everything that is written to it is UTF-8, and your text shows up.

Marvelous.

Except, it's not. If you try to treat the data as text, you'll see that Something Bad is happening. You need go no further than length to see that what Perl thinks about your string and what you think about your string disagree. Write a one-liner like: perl -E 'while(<>){ chomp; say length }' and type in 文字化け and you get 12... not the correct answer, 4.

That's because Perl assumes your string is not text. You have to tell it that it's text before it will give you the right answer.

That's easy enough; the Encode module has the functions to do that. The generic entry point is Encode::decode (or use Encode qw(decode), of course). That function takes some string from the outside world (what we'll call "octets", a fancy of way of saying "8-bit bytes"), and turns it into some text that Perl will understand. The first argument is a character encoding name, like "UTF-8" or "ASCII" or "EUC-JP". The second argument is the string. The return value is the Perl scalar containing the text.

(There is also Encode::decode_utf8, which assumes UTF-8 for the encoding.)

If we rewrite our one-liner:

perl -MEncode=decode -E 'while(<>){ chomp; say length decode("UTF-8", $_) }'

We type in 文字化け and get "4" as the result. Success.

That, right there, is the solution to 99% of Unicode problems in Perl.

The key is, whenever any text comes into your program, you must decode it. The Internet cannot transmit characters. Files cannot store characters. There are no characters in your database. There are only octets, and you can't treat octets as characters in Perl. You must decode the encoded octets into Perl characters with the Encode module.

The other half of the problem is getting data out of your program. That's easy to; you just say use Encode qw(encode), decide what the encoding your data will be in (UTF-8 to terminals that understand UTF-8, UTF-16 for files on Windows, etc.), and then output the result of encode($encoding, $data) instead of just outputting $data.

This operation converts Perl's characters, which is what your program operates on, to octets that can be used by the outside world. It would be a lot easier if we could just send characters over the Internet or to our terminals, but we can't: octets only. So we have to convert characters to octets, otherwise the results are undefined.

To summarize: encode all outputs and decode all inputs.

Now we'll talk about three issues that make this a little challenging. The first is libraries. Do they handle text correctly? The answer is... they try. If you download a web page, LWP will give you your result back as text. If you call the right method on the result, that is (and that happens to be decoded_content, not content, which is just the octet stream that it got from the server.) Database drivers can be flaky; if you use DBD::SQLite with just Perl, it will work out, but if some other tool has put text stored as some encoding other than UTF-8 in your database... well... it's not going to be handled correctly until you write code to handle it correctly.

Outputting data is usually easier, but if you see "wide character in print", then you know you're messing up the encoding somewhere. That warning means "hey, you're trying to leak Perl characters to the outside world and that doesn't make any sense". Your program appears to work (because the other end usually handles the raw Perl characters correctly), but it is very broken and could stop working at any moment. Fix it with an explicit Encode::encode!

The second problem is UTF-8 encoded source code. Unless you say use utf8 at the top of each file, Perl will not assume that your source code is UTF-8. This means that each time you say something like my $var = 'ほげ', you're injecting garbage into your program that will totally break everything horribly. You don't have to "use utf8", but if you don't, you must not use any non-ASCII characters in your program.

The third problem is how Perl handles The Past. A long time ago, there was no such thing as Unicode, and Perl assumed that everything was Latin-1 text or binary. So when data comes into your program and you start treating it as text, Perl treats each octet as a Latin-1 character. That's why, when we asked for the length of "文字化け", we got 12. Perl assumed that we were operating on the Latin-1 string "æå­åã" (which is 12 characters, some of which are non-printing).

This is called an "implicit upgrade", and it's a perfectly reasonable thing to do, but it's not what you want if your text is not Latin-1. That's why it's critical to explicitly decode input: if you don't do it, Perl will, and it might do it wrong.

People run into trouble where half their data is a proper character string, and some is still binary. Perl will interpret the part that's still binary as though it's Latin-1 text and then combine it with the correct character data. This will make it look like handling your characters correctly broke your program, but in reality, you just haven't fixed it enough.

Here's an example: you have a program that reads a UTF-8-encoded text file, you tack on a Unicode PILE OF POO to each line, and you print it out. You write it like:

while(<>){
    chomp;
    say "$_ 💩";
}

And then run on some UTF-8 encoded data, like:

perl poo.pl input-data.txt

It prints the UTF-8 data with a poo at the end of each line. Perfect, my program works!

But nope, you're just doing binary concatenation. You're reading octets from the file, removing a \n with chomp, and then tacking on the bytes in the UTF-8 representation of the PILE OF POO character. When you revise your program to decode the data from the file and encode the output, you'll notice that you get garbage ("ð©") instead of the poo. This will lead you to believe that decoding the input file is the wrong thing to do. It's not.

The problem is that the poo is being implicitly upgraded as latin-1. If you use utf8 to make the literal text instead of binary, then it will work again!

(That's the number one problem I see when helping people with Unicode. They did part right and that broke their program. That's what's sad about undefined results: you can have a working program for a long time, but when you start to repair it, it breaks. Don't worry; if you are adding encode/decode statements to your program and it breaks, it just means you have more work to do. Next time, when you design with Unicode in mind from the beginning, it will be much easier!)

That's really all you need to know about Perl and Unicode. If you tell Perl what your data is, it has the best Unicode support among all popular programming languages. If you assume it will magically know what sort of text you are feeding it, though, then you're going to trash your data irrevocably. Just because your program works today on your UTF-8 terminal doesn't mean it will work tomorrow on a UTF-16 encoded file. So make it safe now, and save yourself the headache of trashing your users' data!

The easy part of handling Unicode is encoding output and decoding input. The hard part is finding all your input and output, and determining which encoding it is. But that's why you get the big bucks :)

Agnusago answered 31/5, 2011 at 18:48 Comment(1)
The principle is explained well, but the practical approach for I/O is missing. Explicitly using the Encode module is tedious and error-prone, and it makes reading the code concerning I/O really painful. I/O layers provide a solution as they transparently encode and decode, where needed. open and binmode allow for their specification, and pragma open sets the defaults, as tchrist recommends in his answer.Claret
P
51

We're all in agreement that it is a difficult problem for many reasons, but that's precisely the reason to try to make it easier on everybody.

There is a recent module on CPAN, utf8::all, that attempts to "turn on Unicode. All of it".

As has been pointed out, you can't magically make the entire system (outside programs, external web requests, etc.) use Unicode as well, but we can work together to make sensible tools that make doing common problems easier. That's the reason that we're programmers.

If utf8::all doesn't do something you think it should, let's improve it to make it better. Or let's make additional tools that together can suit people's varying needs as well as possible.

`

Pilchard answered 29/5, 2011 at 18:59 Comment(3)
I see lots of room for improvement in the cited utf8::all module. It was written before the unicode_strings feature, which Fɪɴᴀʟʟʏ ᴀɴᴅ ᴀᴛ Lᴏɴɢ Lᴀsᴛ fixes regexes to have a /u on them. I’m not convinced it raises an exception on encoding errors, and that is something you truly must have. It doesn’t load in the use charnames ":full" pragma, which isn’t yet autloaded. It doesn’t warn on [a-z] and such, printf string widths, using \n instead of \R and . instead of \X, but maybe those’re more a Perl::Critic matter. If it were I, I’d add 𝐍𝐅𝐃 in and 𝐍𝐅𝐂 out.Sheerlegs
@Sheerlegs The issue tracker for utf8::all is here. github.com/doherty/utf8-all/issues They'd love to hear your suggestions.Gilmer
@Schwern: ᴇɴᴏᴛᴜɪᴛs, but feel free to pilfer and pinch from the stuff I’ve written here. To be honest, I’m still feeling/learning what can be done vs what should be done, and where. Here’s a nice example off offloading sorting: unichars -gs '/(?=\P{Ll})\p{Lower}|(?=\P{Lu})\p{Upper}/x' | ucsort --upper | cat -n | less -r. Similarly, little preprocessing steps like ... | ucsort --upper --preprocess='s/(\d+)/sprintf "%#012d", $1/ge' can be really nice, too, and I wouldn’t want to make others’ decisions for them. I’m still building my Unicode toolbox.Sheerlegs
S
39

I think you misunderstand Unicode and its relationship to Perl. No matter which way you store data, Unicode, ISO-8859-1, or many other things, your program has to know how to interpret the bytes it gets as input (decoding) and how to represent the information it wants to output (encoding). Get that interpretation wrong and you garble the data. There isn't some magic default setup inside your program that's going to tell the stuff outside your program how to act.

You think it's hard, most likely, because you are used to everything being ASCII. Everything you should have been thinking about was simply ignored by the programming language and all of the things it had to interact with. If everything used nothing but UTF-8 and you had no choice, then UTF-8 would be just as easy. But not everything does use UTF-8. For instance, you don't want your input handle to think that it's getting UTF-8 octets unless it actually is, and you don't want your output handles to be UTF-8 if the thing reading from them can't handle UTF-8. Perl has no way to know those things. That's why you are the programmer.

I don't think Unicode in Perl 5 is too complicated. I think it's scary and people avoid it. There's a difference. To that end, I've put Unicode in Learning Perl, 6th Edition, and there's a lot of Unicode stuff in Effective Perl Programming. You have to spend the time to learn and understand Unicode and how it works. You're not going to be able to use it effectively otherwise.

Stenopetalous answered 29/5, 2011 at 17:51 Comment(9)
I think you have a point: it is scary. Should it be? For me is Unicode blessing, using it in Perl5 is not (i don't assume anything being ASCII, my mother tongue needs at least iso8859-4). I installed Rakudo and everything i tried with UTF-8 (in this limited sandbox) worked out of box. Did i miss something? I stress it again: it is good to have fine tuned Unicode support, but on most time is no need for that. To get fear away on topic, one way is that everyone reads a lot to understand internals. Other: we have special pragma, so use utf8_everywhere makes people happy. Why not last one?Sulfatize
I still think you're missing the point. What worked? You don't need to understand internals. You need to understand externals and how to you want to handle strings that have different encodings and different representations of the same characters. Read Tom's advice again. Most of what he says I bet you'll find Rakudo doesn't handle for you.Stenopetalous
Maybe you are right and i miss the point, i don't want to argue. [And i certainly read Tom's answer more and more.] But Randy Stauner pointed in his answer new module utf::all. Is there something wrong with such module? Shouldn't we have it (or similar) with core Perl? From my point of view it makes using UTF-8 so much easier and code clean. No fear at all.Sulfatize
@wk: Read Randy's answer again. He's already told you what the limitations are.Stenopetalous
@brian d foy: i think that those limitation are fine, like tchrist says, there is no magic bullet for every aspect (i admit: i did not saw most of them before asking this question here). So, when we cover lots of basic stuff with something like utf8::all, there is no need for everyone to build his own huge boilerplate only to get basics on utf8 handling to work. With "no fear at all" i mean: everyone can start his projects knowing that basics are covered. Yes, you are right, there is still lots of problems. But when starting is easier, we will have more people involved in solving those. IMHOSulfatize
@wk - the only "wrong" with the "utf8:all" or "uni::perl is only one - they are not in the CORE - so everyone must install it from the CPAN. And if you think that this not a big deal - rethink please - yes, it is easier use utf8 with a helper module. Without it, the CORE perl still has unicode support - but much-much complicated. And this is wrong.Infection
@jm666: i am really confused, why you addressed this comment to me? I'd like to have something like utf8::all in core, but it does not depend of my desire. The whole topic i raised here is in one sentence: how to get UTF-8 handling as easy as possible? So your comment is rephrase of my whole problem. I don't understand, what should i rethink?Sulfatize
Re "you don't want your output handles to be UTF-8 if the thing reading from them can handle UTF-8": Don't you mean "...from them can't handle UTF-8"?Epiglottis
@PeterMortensen yeah, that makes more sense :)Stenopetalous
S
29

While reading this thread, I often get the impression that people are using "UTF-8" as a synonym to "Unicode". Please make a distinction between Unicode's "Code-Points" which are an enlarged relative of the ASCII code and Unicode's various "encodings". And there are a few of them, of which UTF-8, UTF-16 and UTF-32 are the current ones and a few more are obsolete.

Please, UTF-8 (as well as all other encodings) exists and have meaning in input or in output only. Internally, since Perl 5.8.1, all strings are kept as Unicode "Code-points". True, you have to enable some features as admiringly covered previously.

Samos answered 30/5, 2011 at 9:41 Comment(3)
I agree people too often confuse Uɴɪᴄᴏᴅᴇ with UTF-8⧸16⧸32, but it’s fundamentally and critically not true that Uɴɪᴄᴏᴅᴇ is just some enlarged character set relative to ᴀsᴄɪɪ. At most, that’s nothing more than mere ɪsᴏ‑10646. Uɴɪᴄᴏᴅᴇ includes much more: rules for collation,casefolding,normalization forms,grapheme clusters,word-&line-breaking,scripts,numeric equivs,widths,bidirectionality,glyph variants,contextual behavior,locales,regexes,combining classes,100s of properties,& much more‼Sheerlegs
@tchrist: the first step is to get data into your program and out to the outside world without trashing it. then you can worry about collation, case folding, glyph variants, etc. baby steps.Agnusago
I agree, getting perl not to trash input or output must be the first priority. What I would like was to have a module or pragma that could embody the following fictitious conversation: "- Dear Perl. For this program, all input and output will be will be UTF-8 exclusively. Could you please not trash my data? - So only UFT-8 you say. Are you sure? - Yes. - Really, really sure? - Absolutely. - And you accept that I might behave strangely if I'm served non-UTF-8 data? - Yes, fine. - Ok then."Piercy
B
10

There's a truly horrifying amount of ancient code out there in the wild, much of it in the form of common CPAN modules. I've found I have to be fairly careful enabling Unicode if I use external modules that might be affected by it, and am still trying to identify and fix some Unicode-related failures in several Perl scripts I use regularly (in particular, iTiVo fails badly on anything that's not 7-bit ASCII due to transcoding issues).

Blowfly answered 28/5, 2011 at 15:19 Comment(6)
I meant using the -C option to make sure Perl is on the same page as I am Unicode-wise, because I keep having it decide to use ISO 8859/1 instead of Unicode even though I am explicitly setting $LANG and $LC_ALL properly. (This may actually reflect bugs in the platform locale libraries.) Whatever it is, it's been highly annoying that I can't use iTivo on programs with accents in them because the Perl scripts that do the work fall over with conversion errors.Blowfly
A lone -C without options is buggy and error-prone. You break the world. Set the PERL5OPT envariable to -C and you will see what I mean. We tried this way back in v5.8, and it was a disaster. You simply cannot and must not tell programs that aren’t expecting it that now they are dealing with Unicode whether they like it or not. There are also security issues. At the very least, anything that does print while <> will break if passed binary data. So too will all database code. This is a terrible idea.Sheerlegs
I was talking generically, actually, not specifically -C without options. The specific invocation I had been working with was -CSDA. That said, I was stuck with 5.8.x for a long time (hello MacPorts...), so maybe that was part of it.Blowfly
I run with PERL_UNICODE set to SA. You CANNOT set it to D.Sheerlegs
@tchrist: Some Perl varmint has been posting code showing the -CSDA and PERL_UNICODE=SDA usage. Please use your influence in the community. He must be stopped!Babe
@Sheerlegs "A lone -C without options is buggy and error-prone." perldoc perlrun is clear about what -C means. Do you advise against using it because it behaves differently in different versions of Perl? I tried setting PERL5OPT as suggested and saw no difference.Stupid
T
2

You should enable the Unicode strings feature, and this is the default if you use v5.14;.

You should not really use Unicode identifiers, especially for foreign code via UTF-8 as they are insecure in Perl 5; only cperl got that right. See e.g. http://perl11.github.io/blog/unicode-identifiers.html

Regarding UTF-8 for your filehandles/streams: You need decide by yourself the encoding of your external data. A library cannot know that, and since not even libc supports UTF-8, proper UTF-8 data is rare. There's more WTF-8, the Windows aberration of UTF-8 around.

BTW: Moose is not really "Modern Perl"; they just hijacked the name. Moose is perfect Larry Wall-style postmodern Perl mixed with Bjarne Stroustrup-style everything goes, with an eclectic aberration of proper Perl 6 syntax, e.g., using strings for variable names, horrible fields syntax, and a very immature naive implementation which is 10x slower than a proper implementation.

cperl and Perl 6 are the true modern Perl implementations, where form follows function, and the implementation is reduced and optimized.

Tangential answered 14/5, 2018 at 11:59 Comment(2)
The perl11.org link is broken: "Unable to connect. An error occurred during a connection to perl11.org."Epiglottis
We lost the perl11.org domain. It's now at perl11.github.ioTangential

© 2022 - 2024 — McMap. All rights reserved.