Consider:
use URI::Escape;
print uri_unescape("%C3%B3");
Output : ó
Decode with this http://meyerweb.com/eric/tools/dencoder/
Output : ó
This is the expected one.
What Perl library should I use to get the correct output?
Consider:
use URI::Escape;
print uri_unescape("%C3%B3");
Output : ó
Decode with this http://meyerweb.com/eric/tools/dencoder/
Output : ó
This is the expected one.
What Perl library should I use to get the correct output?
If you know that the byte sequence is UTF-8, then use Encode::decode
:
use Encode;
use URI::Escape;
my $in = "%C3%B3";
my $text = Encode::decode('utf8', uri_unescape($in));
print length($text); # Should print 1
0xC3
and 0xB3
. Bytes have no meaning until you assign a meaning to them. If each byte contains one character, you get these weird characters. If these two bytes combined symbolize one character, you get your ó
. The URI::Escape
module has no idea what meaning to assign to these bytes. This is the task of you, the programmer, or a well-defined protocoll (compare the ASCII headers in HTTP requests that contain the Content-encoding
metadata). All Unicode encodings have to be multi-byte encodings, because there is a vast pool of characters. –
Woorali The code Encode::decode('utf8', uri_unescape($in))
doesn't work for me, but the following code works well.
sub smartdecode {
use URI::Escape qw( uri_unescape );
use utf8;
my $x = my $y = uri_unescape($_[0]);
return $x if utf8::decode($x);
return $y;
}
This code is from http://lwp.interglacial.com/ch05_02.htm
To summarize the problem —
"%C3%B3"
ó
ó
Okay, so, let's analyze —
"%C3%B3"
: This is not UTF-8 encoding per-se. This is URI/URL-encoding, i.e., like when a space is swapped with %20
, so, you may have seen URL's like example.com?file=That%20Thing%20I%20Sent%20You
.ó
: This is what we want to decode. With URL encoding, it is encoded as %C3%B3
. Feel free to check by inputting it here: https://www.urlencoder.org/ ; or check its spec here: https://www.fileformat.info/info/unicode/char/00f3/index.htmó
: What is this corruption? Ã
is the UTF-8 character at x00C3
and ³
is the UTF-8 character at x00B3
. (Source: In the links.)Just unescape your string with uri_unescape
...
use URI::Escape;
my $string = "%C3%B3";
print(uri_unescape($string));
As you can tell from above, the problem is not from UTF-8 encodings, but from URI encodings.
To display a UTF-8 string, simply "\N{U+1234}"
, with 1234 being our hex char.
print ("\N{U+263A}"); # print a smiley face
You'll notice that chr(243)
(which is ó) normally gives �
, which is also what "\N{U+00F3}
also gives. What's the deal? Proof: IDEOne Demo This is explained in a note in the Perl Docs:
Note that characters from 128 to 255 (inclusive) are by default internally not encoded as UTF-8 for backward compatibility reasons.
How to fix? Easy: just indicate that your code uses UTF-8, like so...
use utf8;
use open qw( :std :encoding(UTF-8) );
print ("\N{U+00F3}");
© 2022 - 2024 — McMap. All rights reserved.