Perl: tr/// is not doing what I expect whereas s/// is

Asked 23/10, 2016 at 15:11 Answered 23/10, 2016 at 15:49

I want to remove diacritic signs in some strings. tr/// should do the job but fails (see below). I thought I had an encoding/decoding problem, but I noticed s/// works as I expect. Could somebody explain why?

Here is an example of results I get:

my $str1 = 'èîü';
my $str2 = $str1;
$str1 =~ tr/î/i/;
print "$str1\n"; # => i�iii�
$str2 =~ s/î/i/;
print "$str2\n"; # => èiü

Note that tr/// also modified the first and third characters of the string, not just the middle one.

Edit: I use Ubuntu 16.04 with Mate desktop environment.

Unbreathed answered 23/10, 2016 at 15:11 Comment(0)

When you don't have use utf8;, but you are viewing the code with a utf8 text editor, you're not seeing it the way perl sees it. You think you have a single character in the left half of your s/// and tr/// but because it's multiple bytes, perl sees it as multiple characters.

What you think perl sees:

my $str1 = "\xE8\xEE\xFC";
my $str2 = $str1;
$str1 =~ tr/\xEE/i/;
print "$str1\n";
$str2 =~ s/\xEE/i/;
print "$str2\n";

What perl actually sees:

my $str1 = "\xC3\xA8\xC3\xAE\xC3\xBC";
my $str2 = $str1;
$str1 =~ tr/\xC3\xAE/i/;
print "$str1\n";
$str2 =~ s/\xC3\xAE/i/;
print "$str2\n";

With s///, since none of the characters are regexp operators, you're just doing a substring search. You're searching for a multi-character substring. And you find it, because the same thing that happened in your s/// is also happening in your string literals: the characters you think are in there really aren't, but the multi-character sequence is.

In tr/// on the other hand, multiple characters aren't treated as a sequence, they're treated as a set. Each character (byte) is handled separately when it is found. And that doesn't get you the results you want, because changing the individual bytes of a utf8 string is never what you want.

The fact that you can run simple ASCII-oriented substring search that knows nothing about utf8, and get the correct result on a utf8 string, is considered a good backward-compatibility feature of utf8, as opposed to other encodings like ucs2/utf16 or ucs4.

The solution is to tell perl the source is encoded using UTF-8 by adding use utf8;. You'll also need to encode your outputs to match what your terminal expects.

use utf8;                             # The source is encoded using UTF-8.
use open ':std', ':encoding(UTF-8)';  # The terminal provides/expects UTF-8.
my $str1 = 'èîü';
my $str2 = $str1;
$str1 =~ tr/î/i/;
print "$str1\n";
$str2 =~ s/î/i/;
print "$str2\n";

Pepsin answered 23/10, 2016 at 15:49 Comment(0)

This works as expected for me:

use v5.10;
use utf8;
use open qw/:std :utf8/;

my $str1 = 'èîü';
my $str2 = $str1;
$str1 =~ tr/î/i/;
say $str1; # èiü
$str2 =~ s/î/i/;
say $str2; # èiü

The use utf8 pragma enables UTF-8 for literals in the source code, the use open pragma switches STDOUT to UTF-8.

Guardroom answered 23/10, 2016 at 15:16 Comment(4)

It works for me too, thank you. Any idea why tr seems to need these pragmas, while s does not? – Unbreathed 23/10, 2016 at 15:48

I was just going to say something about character string vs. byte string semantics, but see @Wumpus’s answer, I think it explains the issue much better. – Guardroom 23/10, 2016 at 15:51

@zoul, I'm glad you didn't; This has nothing to do with the two internal storage formats. – Hazan 24/10, 2016 at 16:30

I don’t know about the internal storage, but the way I see it, the bug was caused by the programmer treating the string as a collection of UTF-8 characters and Perl (without the Unicode pragmas) seeing them as ASCII strings – or collection of bytes. That’s what I meant by character vs. byte string semantics. – Guardroom 24/10, 2016 at 17:35

Recommended topics

Hot tags