When you don't have use utf8;
, but you are viewing the code with a utf8 text editor, you're not seeing it the way perl sees it. You think you have a single character in the left half of your s///
and tr///
but because it's multiple bytes, perl sees it as multiple characters.
What you think perl sees:
my $str1 = "\xE8\xEE\xFC";
my $str2 = $str1;
$str1 =~ tr/\xEE/i/;
print "$str1\n";
$str2 =~ s/\xEE/i/;
print "$str2\n";
What perl actually sees:
my $str1 = "\xC3\xA8\xC3\xAE\xC3\xBC";
my $str2 = $str1;
$str1 =~ tr/\xC3\xAE/i/;
print "$str1\n";
$str2 =~ s/\xC3\xAE/i/;
print "$str2\n";
With s///
, since none of the characters are regexp operators, you're just doing a substring search. You're searching for a multi-character substring. And you find it, because the same thing that happened in your s///
is also happening in your string literals: the characters you think are in there really aren't, but the multi-character sequence is.
In tr///
on the other hand, multiple characters aren't treated as a sequence, they're treated as a set. Each character (byte) is handled separately when it is found. And that doesn't get you the results you want, because changing the individual bytes of a utf8 string is never what you want.
The fact that you can run simple ASCII-oriented substring search that knows nothing about utf8, and get the correct result on a utf8 string, is considered a good backward-compatibility feature of utf8, as opposed to other encodings like ucs2/utf16 or ucs4.
The solution is to tell perl the source is encoded using UTF-8 by adding use utf8;
. You'll also need to encode your outputs to match what your terminal expects.
use utf8; # The source is encoded using UTF-8.
use open ':std', ':encoding(UTF-8)'; # The terminal provides/expects UTF-8.
my $str1 = 'èîü';
my $str2 = $str1;
$str1 =~ tr/î/i/;
print "$str1\n";
$str2 =~ s/î/i/;
print "$str2\n";
tr
seems to need these pragmas, whiles
does not? – Unbreathed