Problem
The parser is complaining that your text contains namespaces in the element tags, more specifically the prefix on the tag <o:p>
(where o:
is the prefix). It seems to be some kind of formatting for Word.
Reproducing the problem
To reproduce this problem I had to dig a bit because it wasn't PHPWord that was throwing the exception, but DOMDocument
that PHPWord is using. The code below is using the same parsing method that PHPWord is using and should output all warnings and notices about the code.
# Make sure to display all errors
ini_set("display_errors", "1");
error_reporting(E_ALL);
$html = '<o:p>Foo <o:b>Bar</o:b></o:p>';
# Set up and parse the code
$doc = new DOMDocument();
$doc->loadXML($html); # This is the line that's causing the warning.
# Print it back
echo $doc->saveXML();
Analysis
For a well formatted HTML structure it's possible to include the namespaces in the declaration and thus tell the parser what these prefixes actually are. But since it appears to only be part of HTML code that's going to be parsed, it's not possible.
It could be possible to feed the DOMXPath
with the namespace, so that PHPWord
can utilize it. Unfortunately, the DOMXPath
isn't public in the API and therefore not possible.
Instead, it appears the best approach is to strip the prefixes from the tags, and make the warning go away.
Edit 2018-10-04: I've since discovered a way to keep the prefix in the tags and still make the error go away, however the execution isn't optimal. If anyone can make a better solution, feel free to edit my post or leave a comment.
Solution
Based on the analysis, the solution is to remove the prefixes, and in turn we must pre-parse the code. Since PHPWord is using DOMDocument
, we can use it too and be sure that we don't need to install any (extra) dependencies.
PHPWord is parsing the HTML with loadXML
, which is the function that complains about the formatting. It is possible in this method to suppress the error messages, which we will have to do in both of the solutions. This is done by passing an additional parameter into the loadXML
and loadHTML
function.
Solution 1: Pre-parse as XML and remove the prefixes
The first approach will parse the html code as XML and recursively go through the tree and remove any occurrences of the prefix on the tag name.
I've created a class that should solve this problem.
class TagPrefixFixer {
/**
* @desc Removes all prefixes from tags
* @param string $xml The XML code to replace against.
* @return string The XML code with no prefixes in the tags.
*/
public static function Clean(string $xml) {
$doc = new DOMDocument();
/* Load the XML */
$doc->loadXML($xml,
LIBXML_HTML_NOIMPLIED | # Make sure no extra BODY
LIBXML_HTML_NODEFDTD | # or DOCTYPE is created
LIBXML_NOERROR | # Suppress any errors
LIBXML_NOWARNING # or warnings about prefixes.
);
/* Run the code */
self::removeTagPrefixes($doc);
/* Return only the XML */
return $doc->saveXML();
}
private static function removeTagPrefixes(DOMNode $domNode) {
/* Iterate over each child */
foreach ($domNode->childNodes as $node) {
/* Make sure the element is renameable and has children */
if ($node->nodeType === 1) {
/* Iterate recursively over the children.
* This is done before the renaming on purpose.
* If we rename this element, then the children, the element
* would need to be moved a lot more times due to how
* renameNode works. */
if($node->hasChildNodes()) {
self::removeTagPrefixes($node);
}
/* Check if the tag contains a ':' */
if (strpos($node->tagName, ':') !== false) {
print $node->tagName;
/* Get the last part of the tag name */
$parts = explode(':', $node->tagName);
$newTagName = end($parts);
/* Change the name of the tag */
self::renameNode($node, $newTagName);
}
}
}
}
private static function renameNode($node, $newName) {
/* Create a new node with the new name */
$newNode = $node->ownerDocument->createElement($newName);
/* Copy over every attribute from the old node to the new one */
foreach ($node->attributes as $attribute) {
$newNode->setAttribute($attribute->nodeName, $attribute->nodeValue);
}
/* Copy over every child node to the new node */
while ($node->firstChild) {
$newNode->appendChild($node->firstChild);
}
/* Replace the old node with the new one */
$node->parentNode->replaceChild($newNode, $node);
}
}
To use the code, just call the TagPrefixFixer::Clean
function.
$xml = '<o:p>Foo <o:b>Bar</o:b></o:p>';
print TagPrefixFixer::Clean($xml);
Output
<?xml version="1.0"?>
<p>Foo <b>Bar</b></p>
Solution 2: Pre-parse as HTML
I've noticed that if you use loadHTML
instead of loadXML
that PHPWord is using it will remove the prefixes itself upon loading the HTML into the class.
This code is significantly shorter.
function cleanHTML($html) {
$doc = new DOMDocument();
/* Load the HTML */
$doc->loadHTML($html,
LIBXML_HTML_NOIMPLIED | # Make sure no extra BODY
LIBXML_HTML_NODEFDTD | # or DOCTYPE is created
LIBXML_NOERROR | # Suppress any errors
LIBXML_NOWARNING # or warnings about prefixes.
);
/* Immediately save the HTML and return it. */
return $doc->saveHTML();
}
And to use this code, just call the cleanHTML
function
$html = '<o:p>Foo <o:b>Bar</o:b></o:p>';
print cleanHTML($html);
Output
<p>Foo <b>Bar</b></p>
Solution 3: Keep the prefixes and add the namespaces
I've tried to wrap the code with given Microsoft Office namespaces before feeding the data into the parser and that will also fix the issue. Ironically I haven't found a way to add the namespaces with the DOMDocument
parser without actually raising the original warning. So - the execution on this solution is a bit hacky, and I wouldn't recommend using it but instead build your own. But you get the idea:
function addNamespaces($xml) {
$root = '<w:wordDocument
xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml"
xmlns:wx="http://schemas.microsoft.com/office/word/2003/auxHint"
xmlns:o="urn:schemas-microsoft-com:office:office">';
$root .= $xml;
$root .= '</w:wordDocument>';
return $root;
}
And to use this code, just call the addNamespaces
function
$xml = '<o:p>Foo <o:b>Bar</o:b></o:p>';
print addNamespaces($xml);
Output
<w:wordDocument
xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml"
xmlns:wx="http://schemas.microsoft.com/office/word/2003/auxHint"
xmlns:o="urn:schemas-microsoft-com:office:office">
<o:p>Foo <o:b>Bar</o:b></o:p>
</w:wordDocument>
This code can then be feed to the PHPWord function addHtml
without causing any warnings.
Optional solutions (deprecated)
In previous response these were presented as (optional) solutions, but for the sake of problem solving I'm going to let them be here below. Bear in mind, none of these are recommended and should be used with caution.
Turn off warnings
Since it's "just" a warning and not a fatal halting exception, you could turn the warnings off. You can do this by including this code at the top of the script. This will however still slow down your application, and the best approach is always to make sure there are no warnings or errors.
// Show the default reporting except from warnings
error_reporting(E_ALL & ~E_NOTICE & ~E_STRICT & ~E_DEPRECATED & ~E_WARNING);
The settings are derived from the default reporting level.
Using regex
It is (probably) possible to get rid of (most) the namespaces with a regex on your text either before saving it in the database, or after fetching it for use in this function. Since it's already stored in the database it would be better to use the code below after fetching it from the database. The regex can although miss some occurrences or in worst case mess up the HTML.
The regex:
$text_after = preg_replace('/[a-zA-Z]+:([a-zA-Z]+[=>])/', '$1', $text_before);
Example:
$text = '<o:p>Foo <o:b>Bar</o:b></o:p>';
$text = preg_replace('/[a-zA-Z]+:([a-zA-Z]+[=>])/', '$1', $text);
echo $text; // Outputs '<p>Foo <b>Bar</b></p>'