innerHTML unencodes < in attributes
Asked Answered
J

4

8

I have an HTML document that might have &lt; and &gt; in some of the attributes. I am trying to extract this and run it through an XSLT, but the XSLT engine errors telling me that < is not valid inside of an attribute.

I did some digging, and found that it is properly escaped in the source document, but when this is loaded into the DOM via innerHTML, the DOM is unencoding the attributes. Strangely, it does this for &lt; and &gt;, but not some others like &amp;.

Here is a simple example:

var div = document.createElement('DIV');
div.innerHTML = '<div asdf="&lt;50" fdsa="&amp;50"></div>';
console.log(div.innerHTML)

I'm assuming that the DOM implementation decided that HTML attributes can be less strict than XML attributes, and that this is "working as intended". My question is, can I work around this without writing some horrible regex replacement?

Judkins answered 6/10, 2015 at 15:31 Comment(4)
@Abel I am using jQuery's .html(), I just attempted to reduce down to where I think the "problem" is occurring. The source document is XML, which I run through a browser XSLT before inserting with .html(). Later I take it through the inverse process to get the XML back out. I just find it strange that the DOM is unescaping this character (and not others).Judkins
I can't modify the source XML, and need to preserve the same content in the output at the end. I could run whatever transforms are necessary in the middle, but am looking for a way to do it better than some regex replace. Especially considering the character is <, which the document is full of.Judkins
@Abel my only goal is to get it back out of the DOM the same way it went in (as &lt;). I'm putting it in with .text(string) and getting it out with .text(). The problem I have with this round-trip is that the input doesn't equal the output (only in this case).Judkins
Ah, sorry. Well, that is probably only possible with other DOM methods, not with innerHTML. I.e., this works: div.firstChild.attributes['title']. But this requires a whole lot extra machinery to "mimic" innerHTML.Salesgirl
J
0

What ended up working best for me was to double-escape these using an XSLT on the incoming document (and reverse this on the outgoing doc).

So &lt; in an attribute becomes &amp;lt;. Thanks to @Abel for the suggestion.

Here is the XSLT I added, in case others find it helpful:

First is a template for doing string replacements in XSLT 1.0. If you can use XSLT 2.0, you can use the built in replace instead.

<xsl:template name="string-replace-all">
    <xsl:param name="text"/>
    <xsl:param name="replace"/>
    <xsl:param name="by"/>
    <xsl:choose>
        <xsl:when test="contains($text, $replace)">
            <xsl:value-of select="substring-before($text,$replace)"/>
            <xsl:value-of select="$by"/>
            <xsl:call-template name="string-replace-all">
                <xsl:with-param name="text" select="substring-after($text,$replace)"/>
                <xsl:with-param name="replace" select="$replace"/>
                <xsl:with-param name="by" select="$by"/>
            </xsl:call-template>
        </xsl:when>
        <xsl:otherwise>
            <xsl:value-of select="$text"/>
        </xsl:otherwise>
    </xsl:choose>
</xsl:template>

Next are the template that does the specific replacements that I need:

<!-- xml -> html -->
<xsl:template name="replace-html-codes">
    <xsl:param name="text"/>
    <xsl:variable name="lt">
        <xsl:call-template name="string-replace-all">
            <xsl:with-param name="text" select="$text"/>
            <xsl:with-param name="replace" select="'&lt;'"/>
            <xsl:with-param name="by" select="'&amp;lt;'"/>
        </xsl:call-template>
    </xsl:variable>
    <xsl:variable name="gt">
        <xsl:call-template name="string-replace-all">
            <xsl:with-param name="text" select="$lt"/>
            <xsl:with-param name="replace" select="'&gt;'"/>
            <xsl:with-param name="by" select="'&amp;gt;'"/>
        </xsl:call-template>
    </xsl:variable>
    <xsl:value-of select="$gt"/>
</xsl:template>

<!-- html -> xml -->
<xsl:template name="restore-html-codes">
    <xsl:param name="text"/>
    <xsl:variable name="lt">
        <xsl:call-template name="string-replace-all">
            <xsl:with-param name="text" select="$text"/>
            <xsl:with-param name="replace" select="'&amp;lt;'"/>
            <xsl:with-param name="by" select="'&lt;'"/>
        </xsl:call-template>
    </xsl:variable>
    <xsl:variable name="gt">
        <xsl:call-template name="string-replace-all">
            <xsl:with-param name="text" select="$lt"/>
            <xsl:with-param name="replace" select="'&amp;gt;'"/>
            <xsl:with-param name="by" select="'&gt;'"/>
        </xsl:call-template>
    </xsl:variable>
    <xsl:value-of select="$gt"/>
</xsl:template>

The XSLT is mostly a pass-through. I just call the appropriate template when copying attributes:

<xsl:template match="@*">
    <xsl:attribute name="data-{local-name()}">
        <xsl:call-template name="replace-html-codes">
            <xsl:with-param name="text" select="."/>
        </xsl:call-template>
    </xsl:attribute>
</xsl:template>

<!-- copy all nodes -->
<xsl:template match="node()">
    <xsl:copy>
        <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
</xsl:template>
Judkins answered 6/10, 2015 at 23:23 Comment(0)
A
2

Try XMLSerializer:

var div = document.getElementById('d1');

var pre = document.createElement('pre');
pre.textContent = div.outerHTML;
document.body.appendChild(pre);

pre = document.createElement('pre');
pre.textContent = new XMLSerializer().serializeToString(div);
document.body.appendChild(pre);
<div id="d1" data-foo="a &lt; b &amp;&amp; b &gt; c">This is a test</div>

You might need to adapt the XSLT to take account of the XHTML namespace XMLSerializer inserts (at least here in a test with Firefox).

Allnight answered 6/10, 2015 at 16:30 Comment(5)
This is closer to what I want, but it doesn't work in all browsers (IE8 doesn't have XMLSerializer)Judkins
@murrayju, see this question on XML Serializer, if you have to support (old) browsers with < 3% user share, you can, and in this case just use .xml. I think this solution by Martin Honnen is excellent :).Salesgirl
@Abel, I don't think an xml property is implemented in IE or elsewhere for HTML DOM nodes, it only exists for MSXML DOM nodes.Allnight
Yes, that's my point, you will have to make an exception for browsers that do not support XMLSerializer (how to do that is shown in the linked answer),Salesgirl
While I didn't use this to directly solve my problem, it is probably a good idea to implement this as well, since it seems like it would be more reliable than innerHTML. Thanks for the help!Judkins
H
0

I am not sure if this is what you are looking but do have a look.

var div1 = document.createElement('DIV');
var div2  = document.createElement('DIV');
div1.setAttribute('asdf','&lt;50');
div1.setAttribute('fdsa','&amp;50');
div2.appendChild(div1);
console.log(div2.innerHTML.replace(/&amp;/g, '&'));
Haffner answered 6/10, 2015 at 16:17 Comment(5)
I fail to see how this answers the question with escaped less-then characters inside attributes... And you probably don't want every ampersand replaced...Salesgirl
Actually it converts &lt; and &amp; to &amp;lt; and &amp;amp; respectively. The replace function changes it back to it's original format.Haffner
Precisely my point. &amp; should not be replaced, &lt; should only be replaced if it is part of a value of a property as if the string were interpreted as XML. It should not replace other occurrences (text nodes, comment nodes, processing instructions, cdata sections, though some of these are rare in HTML).Salesgirl
What I find frustrating about this is that setAttribute behaves differently than innerHTML for the same literal &lt;. I'm sure this is what @Salesgirl means when he says it is being interpreted "as HTML" in one case but not the other.Judkins
@murrayju, yes, innerHTML is quite an unfortunate part of DOM. Almost all other DOM properties work on the DOM and as XML, but innerHTML does not. It is convenient in some cases, esp. as a setter, but it does not return XML (as you have found out the hard way).Salesgirl
J
0

What ended up working best for me was to double-escape these using an XSLT on the incoming document (and reverse this on the outgoing doc).

So &lt; in an attribute becomes &amp;lt;. Thanks to @Abel for the suggestion.

Here is the XSLT I added, in case others find it helpful:

First is a template for doing string replacements in XSLT 1.0. If you can use XSLT 2.0, you can use the built in replace instead.

<xsl:template name="string-replace-all">
    <xsl:param name="text"/>
    <xsl:param name="replace"/>
    <xsl:param name="by"/>
    <xsl:choose>
        <xsl:when test="contains($text, $replace)">
            <xsl:value-of select="substring-before($text,$replace)"/>
            <xsl:value-of select="$by"/>
            <xsl:call-template name="string-replace-all">
                <xsl:with-param name="text" select="substring-after($text,$replace)"/>
                <xsl:with-param name="replace" select="$replace"/>
                <xsl:with-param name="by" select="$by"/>
            </xsl:call-template>
        </xsl:when>
        <xsl:otherwise>
            <xsl:value-of select="$text"/>
        </xsl:otherwise>
    </xsl:choose>
</xsl:template>

Next are the template that does the specific replacements that I need:

<!-- xml -> html -->
<xsl:template name="replace-html-codes">
    <xsl:param name="text"/>
    <xsl:variable name="lt">
        <xsl:call-template name="string-replace-all">
            <xsl:with-param name="text" select="$text"/>
            <xsl:with-param name="replace" select="'&lt;'"/>
            <xsl:with-param name="by" select="'&amp;lt;'"/>
        </xsl:call-template>
    </xsl:variable>
    <xsl:variable name="gt">
        <xsl:call-template name="string-replace-all">
            <xsl:with-param name="text" select="$lt"/>
            <xsl:with-param name="replace" select="'&gt;'"/>
            <xsl:with-param name="by" select="'&amp;gt;'"/>
        </xsl:call-template>
    </xsl:variable>
    <xsl:value-of select="$gt"/>
</xsl:template>

<!-- html -> xml -->
<xsl:template name="restore-html-codes">
    <xsl:param name="text"/>
    <xsl:variable name="lt">
        <xsl:call-template name="string-replace-all">
            <xsl:with-param name="text" select="$text"/>
            <xsl:with-param name="replace" select="'&amp;lt;'"/>
            <xsl:with-param name="by" select="'&lt;'"/>
        </xsl:call-template>
    </xsl:variable>
    <xsl:variable name="gt">
        <xsl:call-template name="string-replace-all">
            <xsl:with-param name="text" select="$lt"/>
            <xsl:with-param name="replace" select="'&amp;gt;'"/>
            <xsl:with-param name="by" select="'&gt;'"/>
        </xsl:call-template>
    </xsl:variable>
    <xsl:value-of select="$gt"/>
</xsl:template>

The XSLT is mostly a pass-through. I just call the appropriate template when copying attributes:

<xsl:template match="@*">
    <xsl:attribute name="data-{local-name()}">
        <xsl:call-template name="replace-html-codes">
            <xsl:with-param name="text" select="."/>
        </xsl:call-template>
    </xsl:attribute>
</xsl:template>

<!-- copy all nodes -->
<xsl:template match="node()">
    <xsl:copy>
        <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
</xsl:template>
Judkins answered 6/10, 2015 at 23:23 Comment(0)
S
0

Several things worth mentioning that might help someone:

  • Make sure that your HTML is truly valid, e.g. I was accidentally using \ when I should have had / and it caused this problem.
  • As the OP pointed out in the question, you can use &amp;, so you might try e.g. &amp;lt; and &amp;gt;.
  • There are alternatives to < and > that look similar.
  • There is an alternate way to express < and >: &#60; and &#62;.
Scheers answered 10/11, 2020 at 3:47 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.