XML Exception: Invalid Character(s)
Asked Answered
B

8

14

I am working on a small project that is receiving XML data in string form from a long running application. I am trying to load this string data into an XDocument (System.Xml.Linq.XDocument), and then from there do some XML Magic and create an xlsx file for a report on the data.

On occasion, I receive the data that has invalid XML characters, and when trying to parse the string into an XDocument, I get this error.

[System.Xml.XmlException] Message: '?', hexadecimal value 0x1C, is an invalid character.

Since I have no control over the remote application, you could expect ANY kind of character.

I am well aware that XML has a way where you can put characters in it such as &#x1C or something like that.

If at all possible I would SERIOUSLY like to keep ALL the data. If not, than let it be.


I have thought about editing the response string programatically, then going back and trying to re-parse should an exception be thrown, but I have tried a few methods and none of them seem successful.

Thank you for your thought.

Code is something along the line of this:

TextReader  tr;
XDocument  doc;

string           response; //XML string received from server. 
... 
tr = new StringReader (response);   

try
{
    doc = XDocument.Load(tr);
}
catch (XmlException e)
{
    //handle here?
}
Bk answered 12/5, 2009 at 19:6 Comment(0)
U
11

XML can handle just about any character, but there are ranges, control codes and such, that it won't.

Your best bet, if you can't get them to fix their output, is to sanitize the raw data you're receiving. You need replace illegal characters with the character reference format you noted.

(You can't even resort to CDATA, as there is no way to escape these characters there.)

Upland answered 12/5, 2009 at 19:26 Comment(0)
A
25

You can use the XmlReader and set the XmlReaderSettings.CheckCharacters property to false. This will let you to read the XML file despite the invalid characters. From there you can import pass it to a XmlDocument or XDocument object.

You can read a little more about in my blog.

To load the data to a System.Xml.Linq.XDocument it will look a little something like this:

XDocument xDocument = null;
XmlReaderSettings xmlReaderSettings = new XmlReaderSettings { CheckCharacters = false };
using (XmlReader xmlReader = XmlReader.Create(filename, xmlReaderSettings))
{
    xmlReader.MoveToContent();
    xDocument = XDocument.Load(xmlReader);
}

More information can be found here.

Algetic answered 2/8, 2013 at 15:40 Comment(1)
Related links on MSDN: XmlReaderSettings.CheckCharacters Property and XmlReader.MoveToContent Method.Lemonade
U
11

XML can handle just about any character, but there are ranges, control codes and such, that it won't.

Your best bet, if you can't get them to fix their output, is to sanitize the raw data you're receiving. You need replace illegal characters with the character reference format you noted.

(You can't even resort to CDATA, as there is no way to escape these characters there.)

Upland answered 12/5, 2009 at 19:26 Comment(0)
E
10

Would something as described in this blog post be helpful?

Basically, he creates a sanitizing xml stream.

Ephebe answered 12/5, 2009 at 19:13 Comment(4)
Actually, he's processing a XML all at once, as a string.Candelaria
@Matthew, yeah, that's the example where he calls .ReadToEnd(), but you could just use .Read(), etc. My guess is the OP will need to do what you said.Ephebe
That link was extremely usefulBk
I just noticed the XmlSanitizingStream towards the bottom of the blog post. My mistake.Candelaria
P
0

If your input is not XML, you should use something like Tidy or Tagsoup to clean the mess up.

They would take any input and try, hopefully, to make a useful DOM from it.

I don't know how relevant dark side libraries are called.

Piccolo answered 12/5, 2009 at 19:10 Comment(0)
O
0

Garbage In, Garbage Out. If the remote application is sending you garbage, then that's all you'll get. If they think they're sending XML, then they need to be fixed. In this case, you're not doing them any favors by working around their bug.

You should also make sure of what they think they're sending. What did the %1C mean to them? What did they want it to be?

Orit answered 12/5, 2009 at 19:15 Comment(2)
I wish I was in a position to fix their bug, but I'm not... The bug comes from unfiltered user input... Some users decide to put some super weird characters in there... and it accepts it...Bk
My recommendation would be to reject the garbage, then produce a report showing what got rejected. Then send that report to the owner of the buggy code, at least once per month.Orit
A
0

IMHO the best solution would be to modify the code/program/whatever produced the invalid XML that is being fed to your program. Unfortunately this is not always possible. In this case you need to escape all characters < 0x20 before trying to load the document.

Agnosia answered 12/5, 2009 at 19:15 Comment(0)
C
0

If you really can't fix the source XML data, consider taking an approach like I described in this answer. Basically, you create a TextReader subclass (e.g StripTextReader) that wraps an existing TextReader (tr) and discards invalid characters.

Candelaria answered 12/5, 2009 at 19:20 Comment(1)
Your answer implies that the characters really are garbage. That all he needs to do is discard them. I suggested he should first find out what those characters are meant to be.Orit
R
0

Its a late answer, but may help someone. When you read or serialize an XML it may have 1 invisible character at the beginning of the XML. XDocument don't like this invisible character.

So while reading the XML, just start reading from the first < character:

var myXml = XDocument.Parse(loadedString.Substring(loadedString.IndexOf("<")));

That's it and it loads just fine.

Romelda answered 5/7, 2022 at 4:30 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.