How do I prevent Php's DOMDocument from encoding html entities?
Asked Answered
S

5

8

I have a function that replaces anchors' href attribute in a string using Php's DOMDocument. Here's a snippet:

$doc        = new DOMDocument('1.0', 'UTF-8');
$doc->loadHTML($text);
$anchors    = $doc->getElementsByTagName('a');

foreach($anchors as $a) {
    $a->setAttribute('href', 'http://google.com');
}

return $doc->saveHTML();

The problem is that loadHTML($text) surrounds the $text in doctype, html, body, etc. tags. I tried working around this by doing this instead of loadHTML():

$doc        = new DOMDocument('1.0', 'UTF-8');
$node       = $doc->createTextNode($text);
$doc->appendChild($node);
...

Unfortunately, this encodes all the entities (anchors included). Does anyone know how to turn this off? I've already thoroughly looked through the docs and tried hacking it, but can't figure it out.

Thanks! :)

Scorekeeper answered 27/4, 2009 at 5:38 Comment(2)
For everyone visiting this in 2022: Since libxml 2.6, loadHTML supports options that prevents adding surrounding HTML tags: $contentDom->loadHTML($text, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);Mediaeval
See my answer here - https://mcmap.net/q/1471469/-how-can-i-prevent-html-entities-with-php-a-domdocument-savehtmlStogy
J
4
$text is a translated string with place-holder anchor tags

If these place holders have a strict, well-defined format a simple preg_replace or preg_replace_callback might do the trick.
I do not suggest fiddling about html documents with regex in general, but for a small well-defined subset they are suitable.

Jelly answered 27/4, 2009 at 19:18 Comment(0)
J
1

XML has only very few predefined entities. All you html entities are defined somewhere else. When you use loadhtml() these entity definitions are load automagically, with loadxml() (or no load() at all) they are not.
createTextNode() does exactly what the name suggests. Everything you pass as value is treated as text content, not as markup. I.e. if you pass something that has a special meaning to the markup (<, >, ...) it's encoded in a way a parser can distinguish the text from the actual markup (&lt;, &gt;, ...)

Where does $text come from? Can't you do the replacement within the actual html document?

Jelly answered 27/4, 2009 at 9:35 Comment(1)
loadHTML, no entity translation occurs. I ended up hacking around the problem in a tenuous way by running mb_substr($text, 122, -19); on the result from $doc->saveHTML(). Yikes! :) $text is a translated string with place-holder anchor tags, so the replacement has to be done during run time. I'd rather not parse the entire document as it would be difficult to parse only the translated links. Good idea though.Scorekeeper
S
1

Here is a little less hacky solution for this issue, but it works perfectly.

$TempAttributeName='gewrbamsbgadg';

//$node - your a tag DOM node

$newAttr = $dom->createAttribute($TempAttributeName);
$newAttr->value = "{{your_placeholder_or_whatever}}";
$node->setAttributeNode($newAttr);
$node->removeAttribute('href');

//Then replace custom dom node with href
$finalHTMLString = $dom->saveHTML();
$finalHTMLString = str_replace($TempAttributeName,'href',$finalHTMLString);

echo $finalHTMLString;
Stogy answered 18/6, 2023 at 12:0 Comment(0)
S
0

I ended up hacking this in a tenuous way, changing:

return $doc->saveHTML();

into:

$text       = $doc->saveHTML();
return mb_substr($text, 122, -19);

This cuts out all the unnecessary garbage, changing this:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" 
"http://www.w3.org/TR/REC-html40/loose.dtd"> <html><body><p>
You can <a href="http://www.google.com">click here</a> to visit Google.</p>
</body></html> 

into this:

You can <a href="http://www.google.com">click here</a> to visit Google.

Can anyone figure out something better?

Scorekeeper answered 27/4, 2009 at 17:54 Comment(0)
S
-1

OK, here's the final solution I ended up with. Decided to go with VolkerK's suggestion.

public static function ReplaceAnchors($text, array $attributeSets)
{
    $expression = '/(<a)([\s\w\d:\/=_&\[\]\+%".?])*(>)/';

    if (empty($attributeSets) || !is_array($attributeSets)) {
        // no attributes to set. Set href="#".
        return preg_replace($expression, '$1 href="#"$3', $text);
    }

    $attributeStrs  = array();
    foreach ($attributeSets as $attributeKeyVal) {
        // loop thru attributes and set the anchor
        $attributePairs = array();
        foreach ($attributeKeyVal as $name => $value) {
            if (!is_string($value) && !is_int($value)) {
                continue; // skip
            }

            $name               = htmlspecialchars($name);
            $value              = htmlspecialchars($value);
            $attributePairs[]   = "$name=\"$value\"";
        }
        $attributeStrs[]    = implode(' ', $attributePairs);
    }

    $i      = -1;
    $pieces = preg_split($expression, $text);
    foreach ($pieces as &$piece) {
        if ($i === -1) {
            // skip the first token
            ++$i;
            continue;
        }

        // figure out which attribute string to use
        if (isset($attributeStrs[$i])) {
            // pick the parallel attribute string
            $attributeStr   = $attributeStrs[$i];
        } else {
            // pick the last attribute string if we don't have enough
            $attributeStr   = $attributeStrs[count($attributeStrs) - 1];
        }

        // build a opening new anchor for this token.
        $piece  = '<a '.$attributeStr.'>'.preg_replace($expression, '$1 href="#"$3', $piece);
        ++$i;
    }

    return implode('', $pieces);

This allows one to call the function with a set of different anchor attributes.

Scorekeeper answered 27/4, 2009 at 21:52 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.