Decode UTF-8 URL in Perl

Asked 31/10, 2012 at 17:55 Answered 28/11, 2021 at 21:9

Consider:

use URI::Escape;
print uri_unescape("%C3%B3");

Output : Ã³

Decode with this http://meyerweb.com/eric/tools/dencoder/

Output : ó

This is the expected one.

What Perl library should I use to get the correct output?

Carltoncarly answered 31/10, 2012 at 17:55 Comment(0)

If you know that the byte sequence is UTF-8, then use Encode::decode:

use Encode;
use URI::Escape;

my $in = "%C3%B3";
my $text = Encode::decode('utf8', uri_unescape($in));

print length($text);    # Should print 1

Toinette answered 31/10, 2012 at 18:6 Comment(3)

Hi, thanks for the reply, i am having hard time to grasp, what is a byte sequence ? and why need to apply that decode function ? i meant why uri_unescape not enough ? – Carltoncarly 31/10, 2012 at 18:13

@Carltoncarly The URI contains escaped bytes: 0xC3 and 0xB3. Bytes have no meaning until you assign a meaning to them. If each byte contains one character, you get these weird characters. If these two bytes combined symbolize one character, you get your ó. The URI::Escape module has no idea what meaning to assign to these bytes. This is the task of you, the programmer, or a well-defined protocoll (compare the ASCII headers in HTTP requests that contain the Content-encoding metadata). All Unicode encodings have to be multi-byte encodings, because there is a vast pool of characters. – Woorali 31/10, 2012 at 19:3

And that's why I prefaced my answer with "If you know that the byte sequence is utf-8, ..." It is also possible those bytes are part of a utf-16 stream in which case you need to decode with 'utf-16' instead of 'utf-8'. To be sure you need to ask the person who created those bytes how they should be interpeted. – Toinette 31/10, 2012 at 21:4

The code Encode::decode('utf8', uri_unescape($in)) doesn't work for me, but the following code works well.

sub smartdecode {
    use URI::Escape qw( uri_unescape );
    use utf8;
    my $x = my $y = uri_unescape($_[0]);
    return $x if utf8::decode($x);
    return $y;
}

This code is from http://lwp.interglacial.com/ch05_02.htm

Sizemore answered 4/1, 2014 at 17:58 Comment(0)

The Problem

To summarize the problem —

Input: "%C3%B3"
Expected Output: ó
Actual Output: Ã³

So, What Are These Data Formats?

Okay, so, let's analyze —

"%C3%B3" : This is not UTF-8 encoding per-se. This is URI/URL-encoding, i.e., like when a space is swapped with %20, so, you may have seen URL's like example.com?file=That%20Thing%20I%20Sent%20You.
ó: This is what we want to decode. With URL encoding, it is encoded as %C3%B3. Feel free to check by inputting it here: https://www.urlencoder.org/ ; or check its spec here: https://www.fileformat.info/info/unicode/char/00f3/index.htm
Ã³: What is this corruption? Ã is the UTF-8 character at x00C3 and ³ is the UTF-8 character at x00B3. (Source: In the links.)

URI-Encoding - TLDR

Just unescape your string with uri_unescape...

use URI::Escape;

my $string = "%C3%B3";
print(uri_unescape($string));

Full Working Demo

No, you don't need a Package to use UTF-8 in Perl. Even in Perl5.

As you can tell from above, the problem is not from UTF-8 encodings, but from URI encodings.

To display a UTF-8 string, simply "\N{U+1234}", with 1234 being our hex char.

print ("\N{U+263A}");    # print a smiley face

Full Working Demo Online

Handling Latin1 Extension Edgecases

You'll notice that chr(243) (which is ó) normally gives �, which is also what "\N{U+00F3} also gives. What's the deal? Proof: IDEOne Demo This is explained in a note in the Perl Docs:

Note that characters from 128 to 255 (inclusive) are by default internally not encoded as UTF-8 for backward compatibility reasons.

How to fix? Easy: just indicate that your code uses UTF-8, like so...

use utf8;
use open qw( :std :encoding(UTF-8) );

print ("\N{U+00F3}");

Full Working Demo

Purposely answered 28/11, 2021 at 21:9 Comment(0)

The Problem

So, What Are These Data Formats?

URI-Encoding - TLDR

No, you don't need a Package to use UTF-8 in Perl. Even in Perl5.

Handling Latin1 Extension Edgecases

Recommended topics

Hot tags