XML parser error: entity not defined
Asked Answered
C

6

39

I have searched stackoverflow on this problem and did find a few topics, but I feel like there isn't really a solid answer for me on this.

I have a form that users submit and the field's value is stored in a XML file. The XML is set to be encoded with UTF-8.

Every now and then a user will copy/paste text from somewhere and that's when I get the "entity not defined error".

I realize XML only supports a select few entities and anything beyond that is not recognized - hence the parser error.

From what I gather, there's a few options I've seen:

  1. I can find and replace all   and swap them out with   or an actual space.
  2. I can place the code in question within a CDATA section.
  3. I can include these entities within the XML file.

What I'm doing with the XML file is that the user can enter content into a form, it gets stored in a XML file, and that content then gets displayed as XHTML on a Web page (parsed with SimpleXML).

Of the three options, or any other option(s) I'm not aware of, what's really the best way to deal with these entities?

Thanks, Ryan

UPDATE

I want to thank everyone for the great feedback. I actually determined what caused my entity errors. All the suggestions made me look into it more deeply!

Some textboxes where plain old textboxes, but my textareas were enhanced with TinyMCE. It turns out, while taking a closer look, that the PHP warnings always referenced data from the TinyMCE enhanced textareas. Later I noticed on a PC that all the characters were taken out (because it couldn't read them), but on a MAC you could see little square boxes referencing the unicode number of that character. The reason it showed up in squares on a MAC in the first place, is because I used utf8_encode to encode data that wasn't in UTF to prevent other parsing errors (which is somehow also related to TinyMCE).

The solution to all this was quite simple:

I added this line entity_encoding : "utf-8" in my tinyMCE.init. Now, all the characters show up the way they are supposed to.

I guess the only thing I don't understand is why the characters still show up when placed in textboxes, because nothing converts them to UTF, but with TinyMCE it was a problem.

Columbian answered 27/9, 2010 at 14:57 Comment(4)
Some important parts of your question are invisible because they got parsed as markup. Please surround those bits with backquotes (``).Splayfoot
@LarsH: Hm, I don't see anything in the question source that would need this.Uxorious
@Tomalak: "1. I can find and replace all ?? and swap them out with ?? or an actual space." Sure looks to me like something is missing there.Splayfoot
@LarsH: Oh, you're right. I've not noticed these. Only a few more rep to go for you and you can edit questions yourself. :)Uxorious
C
25

I agree that it is purely an encoding issue. In PHP, this is how I solved this problem:

  1. Before passing the html-fragment to SimpleXMLElement constructor I decoded it by using html_entity_decode.

  2. Then further encoded it using utf8_encode().

$headerDoc = '<temp>' . utf8_encode(html_entity_decode($headerFragment)) . '</temp>'; 
$xmlHeader = new SimpleXMLElement($headerDoc);

Now the above code does not throw any undefined entity errors.

Carrissa answered 30/11, 2010 at 7:2 Comment(1)
You might be able to get away without using utf8_encode if you give "UTF-8" to html_entity_decode as the third parameter, e.g. html_entity_decode($headerFragment, null, "UTF-8")Interactive
U
21

You could HTML-parse the text and have it re-escaped with the respective numeric entities only (like: &nbsp;&#160;). In any case — simply using un-sanitized user input is a bad idea.

All of the numeric entities are allowed in XML, only the named ones known from HTML do not work (with the exception of &amp;, &quot;, &lt;, &gt;, &apos;).

Most of the time though, you can just write the actual character (&ouml;ö) to the XML file so there is no need to use an entity reference at all. If you are using a DOM API to manipulate your XML (and you should!) this is your safest bet.

Finally (this is the lazy developer solution) you could build a broken XML file (i.e. not well-formed, with entity errors) and just pass it through tidy for the necessary fix-ups. This may work or may fail depending on just how broken the whole thing is. In my experience, tidy is pretty smart, though, and lets you get away with a lot.

Uxorious answered 27/9, 2010 at 15:4 Comment(6)
"You could HTML-parse the text and have it re-escaped with the respective numeric entities" - does that mean you can always store numeric entities over HTML text entities? -RyanColumbian
@Ryan: Yes, numeric entities are allowed in (and recognized by) both XML and HTML.Uxorious
@Uxorious That means I would have to know all the entities by name and their numeric entity beforehand, right? Is that going to be extremely processing intensive if I add them all in there? -RyanColumbian
@Ryan: There are functions that know all the entity names, you don't have to do that manually. That's what I meant by "HTML-parse". Use an HTML parser for this kind of work.Uxorious
@Uxorious In one of your paragraphs you suggested that you can store the actual character, so technically, before writing it to the XML file, could I just use html_entity_decode to get the character? -RyanColumbian
@Uxorious When you say to use a HTML parser, is that something that's available PHP natively, or do I need a separate "plugin"? If so, can you recommend one? -RyanColumbian
S
5

1. I can find and replace all [&nbsp;?] and swap them out with [&#160;?] or an actual space.

This is a robust method, but it requires you to have a table of all the HTML entities (I assume the pasted input is coming from HTML) and to parse the pasted text for entity references.

2. I can place the code in question within a CDATA section.

In other words disable parsing for the whole section? Then you would have to parse it some other way. Could work.

3. I can include these entities within the XML file.

You mean include the entity definitions? I think this is an easy and robust way, if you don't mind making the XML file quite a bit bigger. You could have an "included" file (find one on the web) which is an external entity, which you reference from the top of your main XML file.

One downside is that the XML parser you use has to be one that processes external entities (which not all parsers are required to do). And it must correctly resolve the (possibly relative) URL of the external entity to something accessible. This is not too bad but it may increase constraints on your processing tools.

4. You could forbid non-XML in the pasted content. Among other things, this would disallow entity references that are not predefined in XML (the 5 that Tomalak mentioned) or defined in the content itself. However this may violate the requirements of the application, if users need to be able to paste HTML in there.

5. You could parse the pasted content as HTML into a DOM tree by setting someDiv.innerHTML = thePastedContent; In other words, create a div somewhere (probably display=none, except for debugging). Say you then have a javascript variable myDiv that holds this div element, and another variable myField that holds the element that is your input text field. Then in javascript you do

myDiv.innerHTML = myField.value;

which takes the unparsed text from myField, parses it into an HTML DOM tree, and sticks it into myDiv as HTML content.

Then you would use some browser-based method for serializing (= "de-parsing") the DOM tree back into XML. See for example this question. Then you send the result to the server as XML.

Whether you want to do this fix in the browser or on the server (as @Hannes suggested) will depend on the size of the data, how quick the response has to be, how beefy your server is, and whether you care about hackers sending not-well-formed XML on purpose.

Splayfoot answered 27/9, 2010 at 15:24 Comment(4)
@Uxorious - why would &ouml; become &amp;ouml;? When the text is put into innerhtml, won't it get parsed into the dom as a single character o-umlaut?Splayfoot
1. Would probably be too much overhead, right? 2. On second thought, this seems counterproductive, so I'm going to eliminate that option. 3. Besides the file being bigger, are there other downsides? If not, I'd say that's the way to go. 4. Yes, that would violate the requirements. 5. I don't understand this solution - can you provide more details? -RyanColumbian
Thank you for doing that! 3. The question is: would it cost more processor time string replacing these values or embedding a DTD that checks for entities? 5. OK, I understand now. I would like to do this on the server. -RyanColumbian
@Ryan - replacing the values yourself is probably faster, since DTD processing is much more general. But you'd have to test it to know for sure.Splayfoot
G
4

Use "htmlentities()" with flag "ENT_XML1": htmlentities($value, ENT_XML1);

If you use "SimpleXMLElement" class:

$SimpleXMLElement->addChild($name, htmlentities($value, ENT_XML1));

Ginn answered 14/10, 2020 at 20:40 Comment(0)
D
2

If you want to convert all characters, this may help you (I wrote it a while back) :

http://www.lautr.com/convert-all-applicable-characters-to-numeric-entities-for-use-in-xml

function _convertAlphaEntitysToNumericEntitys($entity) {
  return '&#'.ord(html_entity_decode($entity[0])).';';
}

$content = preg_replace_callback(
  '/&([\w\d]+);/i',
  '_convertAlphaEntitysToNumericEntitys',
  $content);

function _convertAsciOver127toNumericEntitys($entity) {
  if(($asciCode = ord($entity[0])) > 127)
    return '&#'.$asciCode.';';
  else
    return $entity[0];
}

$content = preg_replace_callback(
  '/[^\w\d ]/i',
  '_convertAsciOver127toNumericEntitys', $content);
Delija answered 27/9, 2010 at 15:9 Comment(2)
well,if you apply "$content = preg_replace_callback('/&([\w\d]+);/i','_convertAlphaEntitysToNumericEntitys',$content);" all HTML entity (&nbsp; and whatnot) would be changed to numeric entities. After that apply "$content = preg_replace_callback('/[^\w\d ]/i','_convertAsciOver127toNumericEntitys'), $content);" and every character above 127 (which is not handled by htmlspecialchars ) is converted into a numeric entity, if I understand it wrong can you please give an example snippet of Input?Delija
sorry, I misunderstood what your code did. Deleting my earlier comment.Splayfoot
F
0

This question is a general problem for any language that parses XML or JSON (so, basically, every language).

The above answers are for PHP, but a Perl solution would be as easy as...

my $excluderegex =
    '^\n\x20-\x20' .   # Don't Encode Spaces
       '\x30-\x39' .   # Don't Encode Numbers
       '\x41-\x5a' .   # Don't Encode Capitalized Letters
       '\x61-\x7a' ;   # Don't Encode Lowercase Letters

    # in case anything is already encoded
$value = HTML::Entities::decode_entities($value);

    # encode properly to numeric
$value = HTML::Entities::encode_numeric($value, $excluderegex);
Faxun answered 15/12, 2017 at 18:51 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.