Covert HTML with PhpWord: error - DOMDocument::loadXML(): Namespace prefix o on p is not defined in Entity
Asked Answered
B

1

9

I am trying to covert HTML formatted with Php word.

I created an html form with summernote. Summernote allows the user to format text. This text is saved to the database with html tags.

Next using phpWord, I would like to output the captured information into a word document. Please see the code below:

$rational = DB::table('rationals')->where('qualificationheader_id',$qualId)->value('rational');

 $wordTest = new \PhpOffice\PhpWord\PhpWord();
        $newSection = $wordTest->addSection();
        $newSection->getStyle()->setPageNumberingStart(1);


    \PhpOffice\PhpWord\Shared\Html::addHtml($newSection,$rational);
    $footer = $newSection->addFooter();
    $footer->addText($curriculum->curriculum_code.'-'.$curriculum->curriculum_title);



    $objectWriter = \PhpOffice\PhpWord\IOFactory::createWriter($wordTest,'Word2007');
    try {
        $objectWriter->save(storage_path($curriculum->curriculum_code.'-'.$curriculum->curriculum_title.'.docx'));
    } catch (Exception $e) {
    }

    return response()->download(storage_path($curriculum->curriculum_code.'-'.$curriculum->curriculum_title.'.docx'));

Text saved in the database looks like this:

<p class="MsoNormal"><span lang="EN-GB" style="background-image: initial; background-position: initial; background-size: initial; background-repeat: initial; background-attachment: initial; background-origin: initial; background-clip: initial;"><span style="font-family: Arial;">The want for this qualification originated from the energy crisis in
South Africa in 2008 together with the fact that no existing qualifications
currently focuses on energy efficiency as one of the primary solutions.  </span><span style="font-family: Arial;">The fact that energy supply remains under
severe pressure demands the development of skills sets that can deliver the
necessary solutions.</span><span style="font-family: Arial;">  </span><o:p></o:p></span></p><p class="MsoNormal"><span lang="EN-GB" style="background-image: initial; background-position: initial; background-size: initial; background-repeat: initial; background-attachment: initial; background-origin: initial; background-clip: initial; font-family: Arial;">This qualification addresses the need from Industry to acquire credible
and certified professionals with specialised skill sets in the energy
efficiency field. The need for this skill set has been confirmed as a global
requirement in few of the International commitment to the reduction of carbon

I get the error below:

ErrorException (E_WARNING) DOMDocument::loadXML(): Namespace prefix o on p is not defined in Entity, line: 1

Bunnie answered 24/9, 2018 at 16:6 Comment(0)
E
24

Problem

The parser is complaining that your text contains namespaces in the element tags, more specifically the prefix on the tag <o:p> (where o: is the prefix). It seems to be some kind of formatting for Word.

Reproducing the problem

To reproduce this problem I had to dig a bit because it wasn't PHPWord that was throwing the exception, but DOMDocument that PHPWord is using. The code below is using the same parsing method that PHPWord is using and should output all warnings and notices about the code.

# Make sure to display all errors
ini_set("display_errors", "1");
error_reporting(E_ALL);

$html = '<o:p>Foo <o:b>Bar</o:b></o:p>';

# Set up and parse the code
$doc = new DOMDocument();
$doc->loadXML($html); # This is the line that's causing the warning.
# Print it back
echo $doc->saveXML();

Analysis

For a well formatted HTML structure it's possible to include the namespaces in the declaration and thus tell the parser what these prefixes actually are. But since it appears to only be part of HTML code that's going to be parsed, it's not possible.

It could be possible to feed the DOMXPath with the namespace, so that PHPWord can utilize it. Unfortunately, the DOMXPath isn't public in the API and therefore not possible.

Instead, it appears the best approach is to strip the prefixes from the tags, and make the warning go away.

Edit 2018-10-04: I've since discovered a way to keep the prefix in the tags and still make the error go away, however the execution isn't optimal. If anyone can make a better solution, feel free to edit my post or leave a comment.

Solution

Based on the analysis, the solution is to remove the prefixes, and in turn we must pre-parse the code. Since PHPWord is using DOMDocument, we can use it too and be sure that we don't need to install any (extra) dependencies.

PHPWord is parsing the HTML with loadXML, which is the function that complains about the formatting. It is possible in this method to suppress the error messages, which we will have to do in both of the solutions. This is done by passing an additional parameter into the loadXML and loadHTML function.

Solution 1: Pre-parse as XML and remove the prefixes

The first approach will parse the html code as XML and recursively go through the tree and remove any occurrences of the prefix on the tag name.

I've created a class that should solve this problem.

class TagPrefixFixer {

    /**
      * @desc Removes all prefixes from tags
      * @param string $xml The XML code to replace against.
      * @return string The XML code with no prefixes in the tags.
    */
    public static function Clean(string $xml) {
        $doc = new DOMDocument();
        /* Load the XML */
        $doc->loadXML($xml,
            LIBXML_HTML_NOIMPLIED | # Make sure no extra BODY
            LIBXML_HTML_NODEFDTD |  # or DOCTYPE is created
            LIBXML_NOERROR |        # Suppress any errors
            LIBXML_NOWARNING        # or warnings about prefixes.
        );
        /* Run the code */
        self::removeTagPrefixes($doc);
        /* Return only the XML */
        return $doc->saveXML();
    }

    private static function removeTagPrefixes(DOMNode $domNode) {
        /* Iterate over each child */
        foreach ($domNode->childNodes as $node) {
            /* Make sure the element is renameable and has children */
            if ($node->nodeType === 1) {
                /* Iterate recursively over the children.
                 * This is done before the renaming on purpose.
                 * If we rename this element, then the children, the element
                 * would need to be moved a lot more times due to how 
                 * renameNode works. */
                if($node->hasChildNodes()) {
                    self::removeTagPrefixes($node);
                }
                /* Check if the tag contains a ':' */
                if (strpos($node->tagName, ':') !== false) {
                    print $node->tagName;
                    /* Get the last part of the tag name */
                    $parts = explode(':', $node->tagName);
                    $newTagName = end($parts);
                    /* Change the name of the tag */
                    self::renameNode($node, $newTagName);
                }
            }
        }
    }

    private static function renameNode($node, $newName) {
        /* Create a new node with the new name */
        $newNode = $node->ownerDocument->createElement($newName);
        /* Copy over every attribute from the old node to the new one */
        foreach ($node->attributes as $attribute) {
            $newNode->setAttribute($attribute->nodeName, $attribute->nodeValue);
        }
        /* Copy over every child node to the new node */
        while ($node->firstChild) {
            $newNode->appendChild($node->firstChild);
        }
        /* Replace the old node with the new one */
        $node->parentNode->replaceChild($newNode, $node);
    }
}

To use the code, just call the TagPrefixFixer::Clean function.

$xml = '<o:p>Foo <o:b>Bar</o:b></o:p>';
print TagPrefixFixer::Clean($xml);

Output

<?xml version="1.0"?>
<p>Foo <b>Bar</b></p>

Solution 2: Pre-parse as HTML

I've noticed that if you use loadHTML instead of loadXML that PHPWord is using it will remove the prefixes itself upon loading the HTML into the class.

This code is significantly shorter.

function cleanHTML($html) {
    $doc = new DOMDocument();
    /* Load the HTML */
    $doc->loadHTML($html,
            LIBXML_HTML_NOIMPLIED | # Make sure no extra BODY
            LIBXML_HTML_NODEFDTD |  # or DOCTYPE is created
            LIBXML_NOERROR |        # Suppress any errors
            LIBXML_NOWARNING        # or warnings about prefixes.
    );
    /* Immediately save the HTML and return it. */
    return $doc->saveHTML();
}

And to use this code, just call the cleanHTML function

$html = '<o:p>Foo <o:b>Bar</o:b></o:p>';
print cleanHTML($html);

Output

<p>Foo <b>Bar</b></p>

Solution 3: Keep the prefixes and add the namespaces

I've tried to wrap the code with given Microsoft Office namespaces before feeding the data into the parser and that will also fix the issue. Ironically I haven't found a way to add the namespaces with the DOMDocument parser without actually raising the original warning. So - the execution on this solution is a bit hacky, and I wouldn't recommend using it but instead build your own. But you get the idea:

function addNamespaces($xml) {
    $root = '<w:wordDocument
        xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml"
        xmlns:wx="http://schemas.microsoft.com/office/word/2003/auxHint"
        xmlns:o="urn:schemas-microsoft-com:office:office">';
    $root .= $xml;
    $root .= '</w:wordDocument>';
    return $root;
}

And to use this code, just call the addNamespaces function

$xml = '<o:p>Foo <o:b>Bar</o:b></o:p>';
print addNamespaces($xml);

Output

<w:wordDocument
    xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml"
    xmlns:wx="http://schemas.microsoft.com/office/word/2003/auxHint"
    xmlns:o="urn:schemas-microsoft-com:office:office">
    <o:p>Foo <o:b>Bar</o:b></o:p>
</w:wordDocument>

This code can then be feed to the PHPWord function addHtml without causing any warnings.

Optional solutions (deprecated)

In previous response these were presented as (optional) solutions, but for the sake of problem solving I'm going to let them be here below. Bear in mind, none of these are recommended and should be used with caution.

Turn off warnings

Since it's "just" a warning and not a fatal halting exception, you could turn the warnings off. You can do this by including this code at the top of the script. This will however still slow down your application, and the best approach is always to make sure there are no warnings or errors.

// Show the default reporting except from warnings
error_reporting(E_ALL & ~E_NOTICE & ~E_STRICT & ~E_DEPRECATED & ~E_WARNING);

The settings are derived from the default reporting level.

Using regex

It is (probably) possible to get rid of (most) the namespaces with a regex on your text either before saving it in the database, or after fetching it for use in this function. Since it's already stored in the database it would be better to use the code below after fetching it from the database. The regex can although miss some occurrences or in worst case mess up the HTML.

The regex:

$text_after = preg_replace('/[a-zA-Z]+:([a-zA-Z]+[=>])/', '$1', $text_before);

Example:

$text = '<o:p>Foo <o:b>Bar</o:b></o:p>';
$text = preg_replace('/[a-zA-Z]+:([a-zA-Z]+[=>])/', '$1', $text);
echo $text; // Outputs '<p>Foo <b>Bar</b></p>'
Evette answered 1/10, 2018 at 19:22 Comment(4)
Regex on HTML? No! #1732848Cauca
You're right @delboy1978uk. I have remade the entire solution with another approach that should be more sustainable.Evette
For anyone that has read to the bottom: I am going to test to add a namespace tag before parsing the data as well to see or that would solve the issue without the need to suppress any warnings, but I won't have the time to do it until later today.Evette
Follow up on my earlier comment: It is possible to keep the prefixes and still parse the code without any warnings/errors. I've added the result as my 3:rd solution, even though the execution of it isn't optimal.Evette

© 2022 - 2024 — McMap. All rights reserved.