Do I encode ampersands in <a href...>?
Asked Answered
E

4

186

I'm writing code that automatically generates HTML, and I want it to encode things properly.

Say I'm generating a link to the following URL:

http://www.google.com/search?rls=en&q=stack+overflow

I'm assuming that all attribute values should be HTML-encoded. (Please correct me if I'm wrong.) So that means if I'm putting the above URL into an anchor tag, I should encode the ampersand as &, like this:

<a href="http://www.google.com/search?rls=en&amp;q=stack+overflow">

Is that correct?

Eclair answered 14/9, 2010 at 1:36 Comment(4)
possible duplicate of Which characters make a URL invalid?Notary
@CiroSantilli: that's about actual URL strings; this is about how they're encoded when they appear in HTML attributes.Eclair
as i see, encoding ampersands is not always required in html5, and answers are outdated.Uppity
question for html5: #19442250Uppity
B
195

Yes, it is. HTML entities are parsed inside HTML attributes, and a stray & would create an ambiguity. That's why you should always write &amp; instead of just & inside all HTML attributes.

That said, only & and quotes need to be encoded. If you have special characters like é in your attribute, you don't need to encode those to satisfy the HTML parser.

It used to be the case that URLs needed special treatment with non-ASCII characters, like é. You had to encode those using percent-escapes, and in this case it would give %C3%A9, because they were defined by RFC 1738. However, RFC 1738 has been superseded by RFC 3986 (URIs, Uniform Resource Identifiers) and RFC 3987 (IRIs, Internationalized Resource Identifiers), on which the WhatWG based its work to define how browsers should behave when they see an URL with non-ASCII characters in it since HTML5. It's therefore now safe to include non-ASCII characters in URLs, percent-encoded or not.

Boiardo answered 14/9, 2010 at 1:39 Comment(10)
You can also encode spaces as "+" rather than %20 - which makes the URL easier to read.Weinberger
+ isn't respected in mailto links in the native iPhone mail client currently, for what it's worth.Intelligence
é still needs encoding: https://mcmap.net/q/18591/-unicode-characters-in-urlsGeognosy
@lulalala, I'd be curious to hear your case on that. I'm a Francophone and have been using French characters (including é) in a handful of URLs and I haven't had issues with it, assuming the web page had the correct encoding, of course. Here's one, and you can check the source to verify that Stack Overflow hasn't encoded it: fr.wikipedia.org/wiki/Allégorie_de_la_caverneBoiardo
@Boiardo When I see the source though it is encoded as &#233. What I want to get across though, is that like &, Unicode characters should be encoded in URLs too.Geognosy
@lulala, I believe that we are both mistaken then. Stack Exchange changed the é to &#233, as you said, but HTML entities are resolved at parse time. I was using the DOM inspector, this is why it showed an é instead. As you correctly note, URLs can only contain US-ASCII characters; however, using an HTML entity in this case instead of a percent encode will still result in a technically invalid URL. This is because of the URL definition and not because of the HTML parser. I'll edit the answer accordingly.Boiardo
@Boiardo Actually you are right (that HTML entities are resolved at parse time). Thanks for the very informative edit!Geognosy
@lulalala, clever people I know noted that the current HTML standard use URL to mean either URI or IRI. IRIs exist solely to allow for UTF-8 sequences.Boiardo
I would add (as I just fell into this mistake) that if you are relying on a template engine you should check if that takes automatically care of escaping HTML entities or not. In my case Twig was doing that, and I was wrongly double-escaping writing &amp; into tag attribute instead of using directly &.Perrins
That leaves open the question of space. Does it need to be percent encoded or not? Can you address that in your answer? (But without "Edit:", "Update:", or similar - the answer should appear as if it was written today.)Olethea
A
28

By current official HTML recommendations, the ampersand must be escaped e.g. as &amp; in contexts like this. However, browsers do not require it, and the HTML5 CR proposes to make this a rule, so that special rules apply in attribute values. Current HTML5 validators are outdated in this respect (see bug report with comments).

It will remain possible to escape ampersands in attribute values, but apart from validation with current tools, there is no practical need to escape them in href values (and there is a small risk of making mistakes if you start escaping them).

Astigmatic answered 9/5, 2013 at 5:29 Comment(6)
XHTML (real XHTML sent as application/xhtml+xml) will most likely always require it, though.Boiardo
One caveat to this change, which is still being discussed, debated, and misunderstood, is that the & is supposed to be ok now, so long as it is "unambiguous". One obvious way to make the ampersand ambiguous is to follow it first with non-space characters and then a semicolon. That ampersand is now ambiguous, and will cause a parse error.Sagunto
As Jukka said, there is certainly a risk to encoding all the ampersands, so consider how likely it is that one of your href urls contains a semicolon. Rather unlikely, as I'm not sure I've ever seen a url with a semicolon. Not that it can't be done. So practically speaking, I don't think it's likely that our use of & will be ambiguous. Therefore, we continue to use it unencoded in href attributes.Sagunto
The whole reason the escaping is necessary is precisely because of the possibility of an ambiguity. This particular issue might not be introducing XSS attack vectors, bad rendering, or any affect at all 99.99% of the time, but that isn't a reason not to bother. Doing escaping correctly is hard and there's always the possibility of making mistakes.Mention
Perhaps it is time for an update? (But without "Edit:", "Update:", or similar - the answer should appear as if it was written today.)Olethea
@Phil, Doing escaping properly is easy if you escape correctly: as, when and how needed. Unfortunately, libraries like WordPress escape at the wrong time, and have become a mess due to incorrect fixes for security issues which weren't even their responsibility. Hence the double-escaping problem, which shouldn't arise with accurate programming.Rufous
A
8

You have two standards concerning URLs in links (<a href).

The first standard is RFC 1866 (HTML 2.0) where in "3.2.1. Data Characters" you can read the characters which need to be escaped when used as the value for an HTML attribute. (Attributes themselves do not allow special characters at all, e.g. <a hr&ef="http://... is not allowed, nor is <a hr&amp;ef="http://....)

Later this has gone into the HTML 4 standard, the characters you need to escape are:

<   to   &lt;
>   to   &gt;
&   to   &amp;
"   to   &quote;
'   to   &apos;

The other standard is RFC 3986 "Generic URI standard", where URLs are handled (this happens when the browser is about to follow a link because the user clicked on the HTML element).

reserved    = gen-delims / sub-delims

gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"

sub-delims  = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="

It is important to escape those characters so the client knows whether they represent data or a delimiter.

Example unescaped:

https://example.com/?user=test&password&te&st&goto=https://google.com

Example, a fully legitimate URL

https://example.com/?user=test&password&te%26st&goto=https%3A%2F%2Fgoogle.com

Example fully legitimate URL in the value of an HTML attribute:

https://example.com/?user=test&amp;password&amp;te%26st&amp;goto=https%3A%2F%2Fgoogle.com

Also important scenarios:

  • JavaScript code as a value:

    <img src="..." onclick="window.location.href = &quot;https://example.com/?user=test&amp;password&amp;te%26st&amp;goto=https%3A%2F%2Fgoogle.com&quot;;">...</a> (Yes, ;; is correct.)

  • JSON as a value:

    <a href="..." data-analytics="{&quot;event&quot;: &quot;click&quot;}">...</a>

  • Escaped things inside escaped things, double encoding, URL inside URL inside parameter, etc,...

    http://x.com/?passwordUrl=http%3A%2F%2Fy.com%2F%3Fuser%3Dtest&amp;password=&quot;&quot;123

I am posting a new answer because I find zneak's answer does not have enough examples, does not show HTML and URI handling as different aspects and standards and has some minor things missing.

Anamorphosis answered 28/12, 2018 at 16:55 Comment(0)
L
3

Yes, you should convert & to &amp;.

This HTML validator tool by W3C is helpful for questions like this. It will tell you the errors and warnings for a particular page.

Laurilaurianne answered 28/4, 2015 at 19:0 Comment(2)
I'm not sure that the W3C validator detects this (unescaped & in a href) as an error.Mcalpin
Currently, the W3C validator accepts unescaped & as valid. Does it mean that the standard has changed and encoding is no longer required? (making most answers here outdated)? If so, does this apply only to href or any attribute?Lecturer

© 2022 - 2024 — McMap. All rights reserved.