urlencode vs rawurlencode?
Asked Answered
R

11

423

If I want to create a URL using a variable I have two choices to encode the string. urlencode() and rawurlencode().

What exactly are the differences and which is preferred?

Remsen answered 15/6, 2009 at 13:33 Comment(2)
I'd really like to see some reasons for choosing one over the other (e.g. problems that might be encountered with one or the other), I (and I expect others) want to be able to just pick one and use it forever with the least fuss, so I've started a bounty on this question.Floodgate
@Tchalvak: If you want to choose just one, choose rawurlencode. You'll seldom run into a system that chokes when given spaces encoded as %20, while systems that choke on spaces encoded as + are more common.Dungeon
D
352

It will depend on your purpose. If interoperability with other systems is important then it seems rawurlencode is the way to go. The one exception is legacy systems which expect the query string to follow form-encoding style of spaces encoded as + instead of %20 (in which case you need urlencode).

rawurlencode follows RFC 1738 prior to PHP 5.3.0 and RFC 3986 afterwards (see https://www.php.net/manual/en/function.rawurlencode.php)

Returns a string in which all non-alphanumeric characters except -_.~ have been replaced with a percent (%) sign followed by two hex digits. This is the encoding described in » RFC 3986 for protecting literal characters from being interpreted as special URL delimiters, and for protecting URLs from being mangled by transmission media with character conversions (like some email systems).

Note on RFC 3986 vs 1738. rawurlencode prior to php 5.3 encoded the tilde character (~) according to RFC 1738. As of PHP 5.3, however, rawurlencode follows RFC 3986 which does not require encoding tilde characters.

urlencode encodes spaces as plus signs (not as %20 as done in rawurlencode)(see https://www.php.net/manual/en/function.urlencode.php)

Returns a string in which all non-alphanumeric characters except -_. have been replaced with a percent (%) sign followed by two hex digits and spaces encoded as plus (+) signs. It is encoded the same way that the posted data from a WWW form is encoded, that is the same way as in application/x-www-form-urlencoded media type. This differs from the » RFC 3986 encoding (see rawurlencode()) in that for historical reasons, spaces are encoded as plus (+) signs.

This corresponds to the definition for application/x-www-form-urlencoded in RFC 1866.

Additional Reading:

You may also want to see the discussion at http://bytes.com/groups/php/5624-urlencode-vs-rawurlencode.

Also, RFC 2396 is worth a look. RFC 2396 defines valid URI syntax. The main part we're interested in is from 3.4 Query Component:

Within a query component, the characters ";", "/", "?", ":", "@", "&", "=", "+", ",", and "$" are reserved.

As you can see, the + is a reserved character in the query string and thus would need to be encoded as per RFC 3986 (as in rawurlencode).

Dehydrogenate answered 15/6, 2009 at 13:38 Comment(7)
rawurlencode. go with the standard in this case. urlencode is only kept for legacy useDehydrogenate
Great thanks, thats what i thought, i just wanted a second opinion before i start updating lots of code.Remsen
it also seems I was incorrect in my initial analysis that urlencode was the legacy option. see my edit for more infoDehydrogenate
I think it's rawurlencode that does not encode spaces as plus signs but as %20sAloft
@Jonathan Fingland Hi Jonathan, I just noticed that this answer is high in a search for urlencode on google. Certainly it's technically correct, but it's kinda hard to read, do you think you'd be willing to reformat it for clarity, to make it a more useful resource for php programmers coming to this page? I'd also be willing to reformat it for clarity myself if you give me the goahead.Floodgate
In my opinion is the answer vague: "spaces encoded as + instead of %20 (in which case you need urlencode)", and further on: "urlencode does not encode spaces as plus signs."Dip
@Pindatjuh: The part you quoted The one exception is legacy systems which expect the query string to follow form-encoding style of spaces encoded as + instead of %20 (in which case you need urlencode) means that while rawurlencode is right for most situation, some systems expect spaces to be encoded as a + (plus sign). For such systems, urlencode is the better choice.Dehydrogenate
T
226

Proof is in the source code of PHP.

I'll take you through a quick process of how to find out this sort of thing on your own in the future any time you want. Bear with me, there'll be a lot of C source code you can skim over (I explain it). If you want to brush up on some C, a good place to start is our SO wiki.

Download the source (or use https://heap.space/ to browse it online), grep all the files for the function name, you'll find something such as this:

PHP 5.3.6 (most recent at time of writing) describes the two functions in their native C code in the file url.c.

RawUrlEncode()

PHP_FUNCTION(rawurlencode)
{
    char *in_str, *out_str;
    int in_str_len, out_str_len;

    if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "s", &in_str,
                              &in_str_len) == FAILURE) {
        return;
    }

    out_str = php_raw_url_encode(in_str, in_str_len, &out_str_len);
    RETURN_STRINGL(out_str, out_str_len, 0);
}

UrlEncode()

PHP_FUNCTION(urlencode)
{
    char *in_str, *out_str;
    int in_str_len, out_str_len;

    if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "s", &in_str,
                              &in_str_len) == FAILURE) {
        return;
    }

    out_str = php_url_encode(in_str, in_str_len, &out_str_len);
    RETURN_STRINGL(out_str, out_str_len, 0);
}

Okay, so what's different here?

They both are in essence calling two different internal functions respectively: php_raw_url_encode and php_url_encode

So go look for those functions!

Lets look at php_raw_url_encode

PHPAPI char *php_raw_url_encode(char const *s, int len, int *new_length)
{
    register int x, y;
    unsigned char *str;

    str = (unsigned char *) safe_emalloc(3, len, 1);
    for (x = 0, y = 0; len--; x++, y++) {
        str[y] = (unsigned char) s[x];
#ifndef CHARSET_EBCDIC
        if ((str[y] < '0' && str[y] != '-' && str[y] != '.') ||
            (str[y] < 'A' && str[y] > '9') ||
            (str[y] > 'Z' && str[y] < 'a' && str[y] != '_') ||
            (str[y] > 'z' && str[y] != '~')) {
            str[y++] = '%';
            str[y++] = hexchars[(unsigned char) s[x] >> 4];
            str[y] = hexchars[(unsigned char) s[x] & 15];
#else /*CHARSET_EBCDIC*/
        if (!isalnum(str[y]) && strchr("_-.~", str[y]) != NULL) {
            str[y++] = '%';
            str[y++] = hexchars[os_toascii[(unsigned char) s[x]] >> 4];
            str[y] = hexchars[os_toascii[(unsigned char) s[x]] & 15];
#endif /*CHARSET_EBCDIC*/
        }
    }
    str[y] = '\0';
    if (new_length) {
        *new_length = y;
    }
    return ((char *) str);
}

And of course, php_url_encode:

PHPAPI char *php_url_encode(char const *s, int len, int *new_length)
{
    register unsigned char c;
    unsigned char *to, *start;
    unsigned char const *from, *end;
    
    from = (unsigned char *)s;
    end = (unsigned char *)s + len;
    start = to = (unsigned char *) safe_emalloc(3, len, 1);

    while (from < end) {
        c = *from++;

        if (c == ' ') {
            *to++ = '+';
#ifndef CHARSET_EBCDIC
        } else if ((c < '0' && c != '-' && c != '.') ||
                   (c < 'A' && c > '9') ||
                   (c > 'Z' && c < 'a' && c != '_') ||
                   (c > 'z')) {
            to[0] = '%';
            to[1] = hexchars[c >> 4];
            to[2] = hexchars[c & 15];
            to += 3;
#else /*CHARSET_EBCDIC*/
        } else if (!isalnum(c) && strchr("_-.", c) == NULL) {
            /* Allow only alphanumeric chars and '_', '-', '.'; escape the rest */
            to[0] = '%';
            to[1] = hexchars[os_toascii[c] >> 4];
            to[2] = hexchars[os_toascii[c] & 15];
            to += 3;
#endif /*CHARSET_EBCDIC*/
        } else {
            *to++ = c;
        }
    }
    *to = 0;
    if (new_length) {
        *new_length = to - start;
    }
    return (char *) start;
}

One quick bit of knowledge before I move forward, EBCDIC is another character set, similar to ASCII, but a total competitor. PHP attempts to deal with both. But basically, this means byte EBCDIC 0x4c byte isn't the L in ASCII, it's actually a <. I'm sure you see the confusion here.

Both of these functions manage EBCDIC if the web server has defined it.

Also, they both use an array of chars (think string type) hexchars look-up to get some values, the array is described as such:

/* rfc1738:

   ...The characters ";",
   "/", "?", ":", "@", "=" and "&" are the characters which may be
   reserved for special meaning within a scheme...

   ...Thus, only alphanumerics, the special characters "$-_.+!*'(),", and
   reserved characters used for their reserved purposes may be used
   unencoded within a URL...

   For added safety, we only leave -_. unencoded.
 */

static unsigned char hexchars[] = "0123456789ABCDEF";

Beyond that, the functions are really different, and I'm going to explain them in ASCII and EBCDIC.

Differences in ASCII:

URLENCODE:

  • Calculates a start/end length of the input string, allocates memory
  • Walks through a while-loop, increments until we reach the end of the string
  • Grabs the present character
  • If the character is equal to ASCII Char 0x20 (ie, a "space"), add a + sign to the output string.
  • If it's not a space, and it's also not alphanumeric (isalnum(c)), and also isn't and _, -, or . character, then we , output a % sign to array position 0, do an array look up to the hexchars array for a lookup for os_toascii array (an array from Apache that translates char to hex code) for the key of c (the present character), we then bitwise shift right by 4, assign that value to the character 1, and to position 2 we assign the same lookup, except we preform a logical and to see if the value is 15 (0xF), and return a 1 in that case, or a 0 otherwise. At the end, you'll end up with something encoded.
  • If it ends up it's not a space, it's alphanumeric or one of the _-. chars, it outputs exactly what it is.

RAWURLENCODE:

  • Allocates memory for the string
  • Iterates over it based on length provided in function call (not calculated in function as with URLENCODE).

Note: Many programmers have probably never seen a for loop iterate this way, it's somewhat hackish and not the standard convention used with most for-loops, pay attention, it assigns x and y, checks for exit on len reaching 0, and increments both x and y. I know, it's not what you'd expect, but it's valid code.

  • Assigns the present character to a matching character position in str.
  • It checks if the present character is alphanumeric, or one of the _-. chars, and if it isn't, we do almost the same assignment as with URLENCODE where it preforms lookups, however, we increment differently, using y++ rather than to[1], this is because the strings are being built in different ways, but reach the same goal at the end anyway.
  • When the loop's done and the length's gone, It actually terminates the string, assigning the \0 byte.
  • It returns the encoded string.

Differences:

  • UrlEncode checks for space, assigns a + sign, RawURLEncode does not.
  • UrlEncode does not assign a \0 byte to the string, RawUrlEncode does (this may be a moot point)
  • They iterate differntly, one may be prone to overflow with malformed strings, I'm merely suggesting this and I haven't actually investigated.

They basically iterate differently, one assigns a + sign in the event of ASCII 20.

Differences in EBCDIC:

URLENCODE:

  • Same iteration setup as with ASCII
  • Still translating the "space" character to a + sign. Note-- I think this needs to be compiled in EBCDIC or you'll end up with a bug? Can someone edit and confirm this?
  • It checks if the present char is a char before 0, with the exception of being a . or -, OR less than A but greater than char 9, OR greater than Z and less than a but not a _. OR greater than z (yeah, EBCDIC is kinda messed up to work with). If it matches any of those, do a similar lookup as found in the ASCII version (it just doesn't require a lookup in os_toascii).

RAWURLENCODE:

  • Same iteration setup as with ASCII
  • Same check as described in the EBCDIC version of URL Encode, with the exception that if it's greater than z, it excludes ~ from the URL encode.
  • Same assignment as the ASCII RawUrlEncode
  • Still appending the \0 byte to the string before return.

Grand Summary

  • Both use the same hexchars lookup table
  • URIEncode doesn't terminate a string with \0, raw does.
  • If you're working in EBCDIC I'd suggest using RawUrlEncode, as it manages the ~ that UrlEncode does not (this is a reported issue). It's worth noting that ASCII and EBCDIC 0x20 are both spaces.
  • They iterate differently, one may be faster, one may be prone to memory or string based exploits.
  • URIEncode makes a space into +, RawUrlEncode makes a space into %20 via array lookups.

Disclaimer: I haven't touched C in years, and I haven't looked at EBCDIC in a really really long time. If I'm wrong somewhere, let me know.

Suggested implementations

Based on all of this, rawurlencode is the way to go most of the time. As you see in Jonathan Fingland's answer, stick with it in most cases. It deals with the modern scheme for URI components, where as urlencode does things the old school way, where + meant "space."

If you're trying to convert between the old format and new formats, be sure that your code doesn't goof up and turn something that's a decoded + sign into a space by accidentally double-encoding, or similar "oops" scenarios around this space/20%/+ issue.

If you're working on an older system with older software that doesn't prefer the new format, stick with urlencode, however, I believe %20 will actually be backwards compatible, as under the old standard %20 worked, just wasn't preferred. Give it a shot if you're up for playing around, let us know how it worked out for you.

Basically, you should stick with raw, unless your EBCDIC system really hates you. Most programmers will never run into EBCDIC on any system made after the year 2000, maybe even 1990 (that's pushing, but still likely in my opinion).

Thermionics answered 9/8, 2011 at 14:57 Comment(3)
I've never had to worry about double encoding after all I should know what I've encoded since it's me doing the encoding I would think. Since I decode everything I receive with the a compatibility mode which knows how to treat + for space I have equally never come across the problems you try to warn about here. I can understand looking at the source if we don't know what something does, but what exactly did we learn here that we didn't know already from simply executing both functions. I know I am biased but I can't help but think this went way overboard. Kudos on the effort though! =)Toilet
+1, for this part: "I believe %20 will actually be backwards compatible, as under the old standard %20 worked, just wasn't preferred"Vicar
"UrlEncode does not assign a \0 byte to the string" This is incorrect. It simply does it differently. See *to = 0;. That can be read as assign the value zero to the location to which to points. And at that time, to is pointing to the place where the null byte should be. Also, 0 and '\0' are equal, just different ways of saying the same thing.Vastitude
S
39
echo rawurlencode('http://www.google.com/index.html?id=asd asd');

yields

http%3A%2F%2Fwww.google.com%2Findex.html%3Fid%3Dasd%20asd

while

echo urlencode('http://www.google.com/index.html?id=asd asd');

yields

http%3A%2F%2Fwww.google.com%2Findex.html%3Fid%3Dasd+asd

The difference being the asd%20asd vs asd+asd

urlencode differs from RFC 1738 by encoding spaces as + instead of %20

Scripture answered 15/6, 2009 at 13:44 Comment(0)
K
29

One practical reason to choose one over the other is if you're going to use the result in another environment, for example JavaScript.

In PHP urlencode('test 1') returns 'test+1' while rawurlencode('test 1') returns 'test%201' as result.

But if you need to "decode" this in JavaScript using decodeURI() function then decodeURI("test+1") will give you "test+1" while decodeURI("test%201") will give you "test 1" as result.

In other words the space (" ") encoded by urlencode to plus ("+") in PHP will not be properly decoded by decodeURI in JavaScript.

In such cases the rawurlencode PHP function should be used.

Klan answered 21/12, 2011 at 14:4 Comment(1)
It is a nice example, though I prefer json_encode and JSON.parse for that purpose.Erastianism
N
22

I believe spaces must be encoded as:

  • %20 when used inside URL path component
  • + when used inside URL query string component or form data (see 17.13.4 Form content types)

The following example shows the correct use of rawurlencode and urlencode:

echo "http://example.com"
    . "/category/" . rawurlencode("latest songs")
    . "/search?q=" . urlencode("lady gaga");

Output:

http://example.com/category/latest%20songs/search?q=lady+gaga

What happens if you encode path and query string components the other way round? For the following example:

http://example.com/category/latest+songs/search?q=lady%20gaga
  • The webserver will look for the directory latest+songs instead of latest songs
  • The query string parameter q will contain lady gaga
Northcutt answered 23/9, 2012 at 19:17 Comment(2)
"The query string parameter q will contain lady gaga" What else would it contain otherwise? The query parameter q seems to have the same value passed to the $_GET array regardless of using rawurlencode or urlencode in PHP 5.2+. Though, urlencode encodes in the application/x-www-form-urlencoded format which is default for GET requests so I'm going with your approach. +1Erastianism
I wanted to clarify that both + and %20 are decoded as space when used in query strings.Northcutt
T
6

1. What exactly are the differences and

The only difference is in the way spaces are treated:

urlencode - based on legacy implementation converts spaces to +

rawurlencode - based on RFC 1738 translates spaces to %20

The reason for the difference is because + is reserved and valid (unencoded) in urls.

2. which is preferred?

I'd really like to see some reasons for choosing one over the other ... I want to be able to just pick one and use it forever with the least fuss.

Fair enough, I have a simple strategy that I follow when making these decisions which I will share with you in the hope that it may help.

I think it was the HTTP/1.1 specification RFC 2616 which called for "Tolerant applications"

Clients SHOULD be tolerant in parsing the Status-Line and servers tolerant when parsing the Request-Line.

When faced with questions like these the best strategy is always to consume as much as possible and produce what is standards compliant.

So my advice is to use rawurlencode to produce standards compliant RFC 1738 encoded strings and use urldecode to be backward compatible and accomodate anything you may come across to consume.

Now you could just take my word for it but lets prove it shall we...

php > $url = <<<'EOD'
<<< > "Which, % of Alice's tasks saw $s @ earnings?"
<<< > EOD;
php > echo $url, PHP_EOL;
"Which, % of Alice's tasks saw $s @ earnings?"
php > echo urlencode($url), PHP_EOL;
%22Which%2C+%25+of+Alice%27s+tasks+saw+%24s+%40+earnings%3F%22
php > echo rawurlencode($url), PHP_EOL;
%22Which%2C%20%25%20of%20Alice%27s%20tasks%20saw%20%24s%20%40%20earnings%3F%22
php > echo rawurldecode(urlencode($url)), PHP_EOL;
"Which,+%+of+Alice's+tasks+saw+$s+@+earnings?"
php > // oops that's not right???
php > echo urldecode(rawurlencode($url)), PHP_EOL;
"Which, % of Alice's tasks saw $s @ earnings?"
php > // now that's more like it

It would appear that PHP had exactly this in mind, even though I've never come across anyone refusing either of the two formats, I cant think of a better strategy to adopt as your defacto strategy, can you?

nJoy!

Toilet answered 28/11, 2012 at 3:54 Comment(0)
I
6

Spaces encoded as %20 vs. +

The biggest reason I've seen to use rawurlencode() in most cases is because urlencode encodes text spaces as + (plus signs) where rawurlencode encodes them as the commonly-seen %20:

echo urlencode("red shirt");
// red+shirt

echo rawurlencode("red shirt");
// red%20shirt

I have specifically seen certain API endpoints that accept encoded text queries expect to see %20 for a space and as a result, fail if a plus sign is used instead. Obviously this is going to differ between API implementations and your mileage may vary.

Induction answered 27/7, 2016 at 21:21 Comment(0)
D
5

The difference is in the return values, i.e:

urlencode():

Returns a string in which all non-alphanumeric characters except -_. have been replaced with a percent (%) sign followed by two hex digits and spaces encoded as plus (+) signs. It is encoded the same way that the posted data from a WWW form is encoded, that is the same way as in application/x-www-form-urlencoded media type. This differs from the » RFC 1738 encoding (see rawurlencode()) in that for historical reasons, spaces are encoded as plus (+) signs.

rawurlencode():

Returns a string in which all non-alphanumeric characters except -_. have been replaced with a percent (%) sign followed by two hex digits. This is the encoding described in » RFC 1738 for protecting literal characters from being interpreted as special URL delimiters, and for protecting URLs from being mangled by transmission media with character conversions (like some email systems).

The two are very similar, but the latter (rawurlencode) will replace spaces with a '%' and two hex digits, which is suitable for encoding passwords or such, where a '+' is not e.g.:

echo '<a href="ftp://user:', rawurlencode('foo @+%/'),
     '@ftp.example.com/x.txt">';
//Outputs <a href="ftp://user:foo%20%40%2B%25%[email protected]/x.txt">
Douala answered 15/6, 2009 at 13:46 Comment(1)
The OP asks how to know which to use, and when. Knowing what each does with spaces doesn't help the OP to make a decision if he does not know the importance of the different return values.Up
U
4

urlencode: This differs from the » RFC 1738 encoding (see rawurlencode()) in that for historical reasons, spaces are encoded as plus (+) signs.

Ulotrichous answered 15/6, 2009 at 13:38 Comment(0)
P
1

I believe urlencode is for query parameters, whereas the rawurlencode is for the path segments. This is mainly due to %20 for path segments vs + for query parameters. See this answer which talks about the spaces: When should space be encoded to plus (+) or %20?

However %20 now works in query parameters as well, which is why rawurlencode is always safer. However the plus sign tends to be used where user experience of editing and readability of query parameters matter.

Note that this means rawurldecode does not decode + into spaces (https://www.php.net/manual/en/function.rawurldecode.php). This is why the $_GET is always automatically passed through urldecode, which means that + and %20 are both decoded into spaces.

If you want the encoding and decoding to be consistent between inputs and outputs and you have selected to always use + and not %20 for query parameters, then urlencode is fine for query parameters (key and value).

The conclusion is:

Path Segments - always use rawurlencode/rawurldecode

Query Parameters - for decoding always use urldecode (done automatically), for encoding, both rawurlencode or urlencode is fine, just choose one to be consistent, especially when comparing URLs.

Perforation answered 7/2, 2014 at 13:9 Comment(0)
K
1

simple * rawurlencode the path - path is the part before the "?" - spaces must be encoded as %20 * urlencode the query string - Query string is the part after the "?" -spaces are better encoded as "+" = rawurlencode is more compatible generally

Kunstlied answered 14/3, 2017 at 11:40 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.