When is a space in a URL encoded to +
, and when is it encoded to %20
?
From Wikipedia (emphasis and link added):
When data that has been entered into HTML forms is submitted, the form field names and values are encoded and sent to the server in an HTTP request message using method GET or POST, or, historically, via email. The encoding used by default is based on a very early version of the general URI percent-encoding rules, with a number of modifications such as newline normalization and replacing spaces with "+" instead of "%20". The MIME type of data encoded this way is application/x-www-form-urlencoded, and it is currently defined (still in a very outdated manner) in the HTML and XForms specifications.
So, the real percent encoding uses %20
while form data in URLs is in a modified form that uses +
. So you're most likely to only see +
in URLs in the query string after an ?
.
multipart/form-data
uses MIME encoding; application/x-www-form-urlencoded
uses +
and properly encoded URIs use %20
. –
Labiodental http://www.bing.com/search?q=hello+world
and a resource with space in the name http://camera.phor.net/cameralife/folders/2012/2012-06%20Pool%20party/
–
Loosetongued +
is used. –
Castro mailto:[email protected]?subject=I%20need%20help
. If you tried that with +, the email will open with +es instead of spaces. –
Westmorland This confusion is because URLs are still 'broken' to this day.
From a blog post:
Take "http://www.google.com" for instance. This is a URL. A URL is a Uniform Resource Locator and is really a pointer to a web page (in most cases). URLs actually have a very well-defined structure since the first specification in 1994.
We can extract detailed information about the "http://www.google.com" URL:
+---------------+-------------------+ | Part | Data | +---------------+-------------------+ | Scheme | http | | Host | www.google.com | +---------------+-------------------+
If we look at a more complex URL such as:
"https://bob:[email protected]:8080/file;p=1?q=2#third"
we can extract the following information:
+-------------------+---------------------+ | Part | Data | +-------------------+---------------------+ | Scheme | https | | User | bob | | Password | bobby | | Host | www.lunatech.com | | Port | 8080 | | Path | /file;p=1 | | Path parameter | p=1 | | Query | q=2 | | Fragment | third | +-------------------+---------------------+ https://bob:[email protected]:8080/file;p=1?q=2#third \___/ \_/ \___/ \______________/ \__/\_______/ \_/ \___/ | | | | | | \_/ | | Scheme User Password Host Port Path | | Fragment \_____________________________/ | Query | Path parameter Authority
The reserved characters are different for each part.
For HTTP URLs, a space in a path fragment part has to be encoded to "%20" (not, absolutely not "+"), while the "+" character in the path fragment part can be left unencoded.
Now in the query part, spaces may be encoded to either "+" (for backwards compatibility: do not try to search for it in the URI standard) or "%20" while the "+" character (as a result of this ambiguity) has to be escaped to "%2B".
This means that the "blue+light blue" string has to be encoded differently in the path and query parts:
"http://example.com/blue+light%20blue?blue%2Blight+blue".
From there you can deduce that encoding a fully constructed URL is impossible without a syntactical awareness of the URL structure.
This boils down to:
You should have %20
before the ?
and +
after.
key1=value1&key1=value2
where keys and values are encoded with whatever rules encodeURIComponent
follow but AFAIK the contents of the query part is entirely 100% up to the app. Other then it only goes to the first #
there's no official encoding. –
Spotted ?
, but after the ?
it is simply a matter of taste. For the love of God, people, just always use the percent sign-based encoding and clear out some brain space for more important stuff. –
Singhalese +
as space that's fine, the URL standard doesn't care. –
Dinghy /
is valid in query
but both rfc2396 and rfc1738 show it as a reserved character. –
Connors A space may only be encoded to "+" in the "application/x-www-form-urlencoded" content-type key-value pairs query part of an URL. In my opinion, this is a may, not a must. In the rest of URLs, it is encoded as %20.
In my opinion, it's better to always encode spaces as %20, not as "+", even in the query part of an URL, because it is the HTML specification (RFC 1866) that specified that space characters should be encoded as "+" in "application/x-www-form-urlencoded" content-type key-value pairs (see paragraph 8.2.1. subparagraph 1.)
This way of encoding form data is also given in later HTML specifications. For example, look for relevant paragraphs about application/x-www-form-urlencoded in HTML 4.01 Specification, and so on.
Here is a sample string in a URL where the HTML specification allows encoding spaces as pluses: "http://example.com/over/there?name=foo+bar". So, only after "?", spaces can be replaced by pluses. In other cases, spaces should be encoded to %20. But since it's hard to determine the context correctly, it's the best practice to never encode spaces as "+".
I would recommend to percent-encode all character except "unreserved" defined in RFC 3986, p.2.3
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
The implementation depends on the programming language that you chose.
If your URL contains national characters, first encode them to UTF-8 and then percent-encode the result.
I would recommend %20
.
Are you hard-coding them?
This is not very consistent across languages, though.
If I'm not mistaken, in PHP urlencode()
treats spaces as +
whereas Python's urlencode()
treats them as %20
.
EDIT:
It seems I'm mistaken. Python's urlencode()
(at least in 2.7.2) uses quote_plus()
instead of quote()
and thus encodes spaces as "+".
It seems also that the W3C recommendation is the "+" as per here: http://www.w3.org/TR/html4/interact/forms.html#h-17.13.4.1
And in fact, you can follow this interesting debate on Python's own issue tracker about what to use to encode spaces: http://bugs.python.org/issue13866.
EDIT #2:
I understand that the most common way of encoding " " is as "+", but just a note, it may be just me, but I find this a bit confusing:
import urllib
print(urllib.urlencode({' ' : '+ '})
>>> '+=%2B+'
URLEncoder.encode()
method in Java converts it in +
as well. –
Lycurgus To summarize the (somewhat conflicting) answers here, I think it can be boiled down to:
| standard | + | %20 |
|---------------+-----+-----|
| URL | no | yes |
| query string | yes | yes |
| form params | yes | no |
| mailto query | no | yes |
So historically I think what happened is:
- The RFC specified a pretty clear standard about the form of URLs and how they are encoded. In this context the query is just a "string", there is no specification how key/value pairs should be encoded
- The HTTP guys put out a standard of how key/value pairs are to be encoded in form params, and borrowed from the URL encoding standard, except that spaces should be encoded as
+
. - The web guys said: cool we have a way to encode key/value pairs let's put that into the URL query string
Result: We end up with two different ways how to encode spaces in a URL depending on which part you're talking about. But it doesn't even violate the URL standard. From URL perspective the "query" is just a blackbox. If you want to use other encodings there besides percent encoding: knock yourself out.
But as the email example shows it can be problematic to borrow from the form-params implementation for an URL query string. So ultimately using %20 is safer but there might not be out of the box library support for it.
As I was surprised that nobody did cite the actual RFC 3986 on "percent encoding", I'm adding my own answer here:
As the aforementioned RFC does not include any reference of encoding spaces as +
, I guess using %20
is the way to go today.
For example, "%20" is the percent-encoding for the binary octet "00100000" (ABNF: %x20), which in US-ASCII corresponds to the space character (SP).
© 2022 - 2024 — McMap. All rights reserved.