How to escape Chinese Unicode characters in URL?
Asked Answered
E

3

8

I have Chinese users of my PHP web application who enter products into our system. The information the’re entering is for example a product title and price.

We would like to use the product title to generate a nice URL slug for those product. Seems like we cannot just use Chinese as HREF attributes.

Does anyone know how we handle a title like “婴儿服饰” so that we can generate a clean url like http://www.site.com/婴儿服饰 ?

Everything works fine for “normal” languages, but high UTF‐8 languages give us problems.

Also, when generating the clean URL, we want to keep SEO in mind, but I have no experience with Chinese in that matter.

Excaudate answered 27/5, 2011 at 13:0 Comment(1)
What the heck is a "high utf8" language, anyway??? I have no idea what the UTF-16 tag is there for, but you should have used the PHP tag if you weren’t looking for a general answer. Also, you have shown us no code at all, so it is impossible to say what is wrong.Proserpina
S
6

If your string is already UTF-8, just use rawurlencode to encode the string properly:

$path = '婴儿服饰';
$url = 'http://example.com/'.rawurlencode($path);

UTF-8 is the preferred character encoding for non-ASCII characters (although only ASCII characters are allowed in URIs which is why you need to use the percent-encoding). The result is the same as in tchrist’s example:

http://example.com/%E5%A9%B4%E5%84%BF%E6%9C%8D%E9%A5%B0
Statolith answered 27/5, 2011 at 13:53 Comment(0)
P
6

This code, which uses the CPAN module, URI::Escape:

#!/usr/bin/env perl

use v5.10;
use utf8;

use URI::Escape qw(uri_escape_utf8);

my $url  = "http://www.site.com/";
my $path = "婴儿服饰";

say $url, uri_escape_utf8($path);

when run, prints:

http://www.site.com/%E5%A9%B4%E5%84%BF%E6%9C%8D%E9%A5%B0

Is that what you're looking for?

BTW, those four characters are:

CJK UNIFIED IDEOGRAPH-5A74
CJK UNIFIED IDEOGRAPH-513F
CJK UNIFIED IDEOGRAPH-670D
CJK UNIFIED IDEOGRAPH-9970

Which, according to the Unicode::Unihan database, seems to be yīng ér fú shì, or perhaps just ying er fú shi per Lingua::ZH::Romanize::Pinyin. And maybe even jing¹ jan⁴ fuk⁶ sik¹ or jing˥ jan˨˩ fuk˨ sik˥, using the Cantonese version from Unicode::Unihan.

Proserpina answered 27/5, 2011 at 13:16 Comment(0)
S
6

If your string is already UTF-8, just use rawurlencode to encode the string properly:

$path = '婴儿服饰';
$url = 'http://example.com/'.rawurlencode($path);

UTF-8 is the preferred character encoding for non-ASCII characters (although only ASCII characters are allowed in URIs which is why you need to use the percent-encoding). The result is the same as in tchrist’s example:

http://example.com/%E5%A9%B4%E5%84%BF%E6%9C%8D%E9%A5%B0
Statolith answered 27/5, 2011 at 13:53 Comment(0)
S
1

Use encoded url as href attribute of the link, and keep original characters as content of the link.

Then you could have the safe url and make the webpage SEO friendly.

// Safely convert url like "http://example.com/婴儿服饰" to valid encoded string
// => http://example.com/%E5%A9%B4%E5%84%BF%E6%9C%8D%E9%A5%B0
// KEY: multipart character occupies more than one byte
function autoEncodeMultibyteChars($url) {
    $encoding   = 'UTF-8';
    $mbLen      = mb_strlen($url, $encoding);
    $append     = '';
    for ($idx = 0; $idx < $mbLen; $idx++) {
        $char   = mb_substr($url, $idx, 1, $encoding);
        if (strlen($char) > 1) {    // multibyte char
            $append     .= rawurlencode($char);
        } else {
            $append     .= $char;
        }
    }
    return  $append;
}
Scudo answered 31/5, 2022 at 3:30 Comment(1)
As it’s currently written, your answer is unclear. Please edit to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers in the help center.Systemize

© 2022 - 2024 — McMap. All rights reserved.