Hyperlink href incorrectly quoted in innerHTML?
Asked Answered
I

2

0

Take this very simple example HTML:

<html>
    <body>This is okay &amp; fine, but the encoding of <a href="http://example.com?a=1&b=2">this link</a> seems wrong.</body>
<html>

On examining document.body.innerHTML (e.g. in the browser's JS console, in JS itself, etc.), this is the value I see:

This is okay &amp; fine, but the encoding of <a href="http://example.com?a=1&amp;b=2">this link</a> seems wrong.

This behaviour is the same across browsers but I can't understand it, it seems wrong.

Specifically, the link in the orginal document is to http://example.com?a=1&b=2, whereas if the value of innerHTML is treated as HTML then it links to http://example.com?a=1&amp;b=2 which is NOT the same (e.g. If I created a new document, which actually had innerHTML as its inner HTML, and I clicked on the link then the browser would be sent to a materially different URL as far as I can see).

(EDIT #3: I'm wrong about the above. Firstly, yes, those two URLs are different; but secondly, the innerHTML which I thought was wrong is right, and it correctly represents the first URL, not the second! See the end of my own answer below.)

This is different from the issue discussed in question innerHTML gives me & as &amp; !. In my case (which is the opposite to the case in that question) the original HTML is correct and it looks to me as if it is the innerHTML which is wrong (i.e. because it is HTML which does not represent what the original HTML represented).

(EDIT #2: I was wrong about this, too: it's not really different. But I think it is not widely known that &amp; is the correct way to represent & inside an href, not just within body text. Once you realise that, then you can see that these are the same issue really.)

Can anyone explain this?

(EDIT #1+4: This only occurred to me a bit late, after writing my original question, but: "is &amp; actually correct within the href text, and & technically incorrect?" As I said when I first wrote those words, that "seems very unlikely! I've certainly never seen HTML written that way." But however 'unlikely', or not, that is the case, and is the main part of what I wasn't understanding!)

Also related and would be useful, can anyone explain how to cleanly get HTML which does correctly represent the target of document links? You definitely can't just un-encode all HTML character references within innerHTML, because (as shown in the example I've used, and also as discussed in innerHTML gives me & as &amp; !) the ones in the main run of text should be encoded, and just un-encoding everything would make these wrong.

I originally thought this was not a duplicate of innerHTML gives me & as &amp; ! (as discussed above; and in a way it still isn't, if it's agreed that it's not as obvious or widely known that the same issues apply inside href as in body text). It's still definitely not a duplicate of A href in innerHTML (which somehwat unclearly asks about how to set innerHTML using JS).

Iodism answered 25/6, 2020 at 14:23 Comment(3)
Instead of innerHTML look in document.getElementsByTagName("a")[0].href (== "example.com/?a=1&b=2"). This is specifically for the first <a> tag - modify as needed.Bucher
I'm trying to take the whole of innerHTML from an HTML editing component to use as a template elsewhere, so I think it would be hard to stick those back in the right place. But still that's a useful alternative to know about - thanks!Iodism
Using innerText will not encode the text, and get the literal text within the html (also will strip any kind of other content when you have nested elements inside a non-a element.Lunette
I
2

Most browser tools don't show the actual HTML because it wouldn't be of much help:

  • HTML is often generated dynamically after page load with the help of CSS and JavaScript.
  • HTML is often broken and the browser needs to repair it in order to generate the memory representation needed for rendering and other stuff.

So the HTML you see is not the actual source but it's generated on the fly from the current status of the document, which of course includes all the fixed applied (in your case, the invalid HTML entities).

The following example hopefully illustrates all the combinations:

const section = document.querySelector("section");
const invalid = document.createElement("p");
invalid.innerHTML = '<a href="http://example.com/?a=1&b=2">Invalid HTML (dynamic)</a>';
const valid = document.createElement("p");
valid.innerHTML = '<a href="http://example.com/?a=1&amp;b=2">Valid HTML (dynamic)</a>';
section.appendChild(valid);
section.appendChild(invalid);
const paragraphs = document.querySelectorAll("p");
for (p of paragraphs) {
  console.log(p.innerHTML);
}
const links = document.querySelectorAll("a");
for (a of links) {
  console.log(a.getAttribute("href"));
}
<section>
  <p><a href="http://example.com/?a=1&b=2">Invalid HTML (static)</a></p>
  <p><a href="http://example.com/?a=1&amp;b=2">Valid HTML (static)</a></p>
<section>

Is &amp; actually correct within the href text, and & technically incorrect? It seems very unlikely! I've certainly never seen HTML written that way.

There's no such thing as "technically correct", let alone today when HTML is pretty well standardised. (Well, yes, there're two competing standards bodies and specs are continuously evolving, but the basics were set up long ago.)

The & symbol starts a character entity and &b is an invalid character entity. Period.

WC Validator screenshot

But it works! Doesn't that mean it's technically correct?

It works because browsers are explicitly designed to deal with completely broken markup, what's known as tag soup, because it was thought that it would ease usage:

<p><strong>Hello, World!</u>
<body><br itspartytime="yeah">
  <pink>It works!!!</red>

But HTML entities are just an encoding artefact. That doesn't mean that URLs are not allowed to contain literal ampersands, it just means that —when in HTML context— they need to be represented as &amp;. It's the same as when you type a backslash in a JavaScript string to escape some quotes: the backslash does not become part of your data.

Improvise answered 25/6, 2020 at 16:13 Comment(9)
Yes, but I think that info is largely irrelevant here (unfortunately), because this is a difference between what the browser tools show (<a href="http://example.com/?a=1&b=2">) and what comes back in innerHTML (<a href="http://example.com/?a=1&amp;b=2">).Iodism
innerHTML isn't the actual HTML source sent by the server either, it's also generated from the current DOM tree. The rationale is the same.Ardeb
That rationale might be the same, but the results aren't; innerHTML contains a modified href, but the DOM tree does not. (If React created a hyperlink with an href of "example.com/?a=1&b=2" then the DOM href would be href="example.com/?a=1&b=2", so to imply that this is because of what browsers do when dealing with dynamically generated source is misleading: browsers actually DON'T do this - don't make this change - when dealing with dynamically generated source, like React, but always do do this when producing innerHTML, even when not from dynamically generated source!)Iodism
Then I either didn't understand your question or my browser doesn't exhibit the same behaviour, sorry.Ardeb
Okay, thanks for adding the additinal info. So, if you set the original value using innerHTML then it is already modified immediately. I see. But it is still true that there are lots of other ways to set the original value (including, for instance, loading my very short sample HTML into your browser), and as you can see in that case the href is NOT modified in the DOM. So (I think) you could still see this as primarily an issue about how innerHTML works (okay, on both setting and getting), not primarily about how HTML is modified before it is ever stored in the DOM. Thank you for clarifying.Iodism
I have never written and do not think "But it works! Doesn't that mean it's technically correct?"!!! I agree I misunderstood something, obviously, but if most of your post 'explains' misunderstandings that were not present in my OP then I will find most of it irrelevant (and so, perhaps, will someone else with the same issue). I misunderstood the applicability of entities within href. I knew about them outside that; I thought the contexts were different. And I thought and still think that the difference in behaviour in going to/from HTML to DOM and to/from innerHTML to DOM is surprising.Iodism
I agree your investigation has added something, namely that writing to innerHTML is also surprising. But your much shorter first answer was that this was all about me not understanding that browsers change what they're given when producing the DOM, and I do and did understand that. If you'd tried my example, you'd have seen that the browser doesn't change what is given there, but sets href to "http://example.com/?a=1&b=2", so your original point that this is about changes when raw HTML is converted to DOM cannot be the key answer to my OP (since that doesn't happen with my OP code).Iodism
I didn't say you wrote that. It's a figure of speech, Socratic dialogue if I recall correctly. As I said, I don't really understand your question. I've added the information in benefit of whoever comes here in the future. You don't need to mark it as accepted since it doesn't answer the question.Ardeb
Okay, I've done some more playing around and I'm starting to see what you're saying. It took me a while. :-/ I'd appreciate if you have two apparent direct quotes in your answer if you could add something like "As the OP says:" and "As someone might then say:". (To me, if I was reading your answer, I'd assume that the OP had said the second quote at some point, and then maybe edited it out of their answer.)Iodism
I
0

Having thought up a possible (but I thought 'unlikely') explanation - which I put in as an edit in the original question - I've realised that it is the answer:

  • Using & to represent & inside an href is technically incorrect, and &amp; is technically correct

I gathered this initially from this SO answer https://mcmap.net/q/1925021/-spec-for-handling-of-html-entities-in-a-href, and I think it is relevant that (as it also says in that answer) the idea that &amp; is the correct way to represent & in an href is not as widely understood as the idea that &amp; is the correct way to represent & in body text.

Once you do understand this, it makes sense that what the browser is doing is right, and that the innerHTML value which comes back represents the link correctly.

EDIT:

@ÁlvaroGonzález gives a much longer answer, and it took me a while to see how everything he says applies, so I thought I'd try to explain what I didn't understand starting from where I started from, in case it helps someone else!

If you start with raw HTML with <a href="http://example.com/?a=1&b=1"> and then you inspect the DOM in the browser, or look at the value of the href attribute in JS then you see "http://example.com/?a=1&b=1" everywhere. So it looks as if nothing has changed, and nothing was wrong. What I didn't understand is that actually the browser has parsed a technically incorrect href (with invalid entities) to be able to display this to you! (Yes, LOTS of people use this 'broken' format!)

To see this first hand, load this longer HTML example into your browser:

<html>
    <body style="font-family: sans-serif">
        <p>Now & then <a href="http://example.com/?a=1&b=2">http://example.com/?a=1&b=2</a></p>
        <p>Now &amp; then <a href="http://example.com/?a=1&amp;b=2">http://example.com/?a=1&amp;b=2</a></p>
        <p>Now &amp;amp; then <a href="http://example.com/?a=1&amp;amp;b=2">http://example.com/?a=1&amp;amp;b=2</a></p>
    </body>
</html>

then in your javascript console try running this code taken from @ÁlvaroGonzález's answer:

const paragraphs = document.querySelectorAll("p");
for (p of paragraphs) {
  console.log(p.innerHTML);
}
const links = document.querySelectorAll("a");
for (a of links) {
  console.log(a.getAttribute("href"));
}

Also try clicking on the links to see where they go.

Once you've made sense of everything that you see there, it is no longer surprising how innerHTML works!

Iodism answered 25/6, 2020 at 15:29 Comment(1)
You could also use innerText it will do the same, except not encode any ampersands etc.Lunette

© 2022 - 2024 — McMap. All rights reserved.