Regex to remove xml declaration from a string
Asked Answered
C

2

7

First of all, I know this is a bad solution and I shouldn't be doing this.

Background: Feel free to skip


However, I need a quick fix for a live system. We currently have a data structure which serialises itself to a string by creating "xml" fragments via a series of string builders. Whether this is valid XML I rather doubt. After creating this xml, and before sending it over a message queue, some clean-up code searches the string for occurrences of the xml declaration and removes them.

The way this is done (iterate every character doing indexOf for the <?xml) is so slow its causing thread timeouts and killing our systems. Ultimately I'll be trying to fix this properly (build xml using xml documents or something similar) but for today I need a quick fix to replace what's there.

Please bear in mind, I know this is a far from ideal solution, but I need a quick fix to get us back up and running.


Question

My thought to use a regex to find the declarations. I was planning on: <\?xml.*?>, then using Regex.Replace(input, string.empty) to remove.

Could you let me know if there are any glaring problems with this regex, or whether just writing it in code using string.IndexOf("<?xml") and string.IndexOf("?>") pairs in a (much saner) loop is better.

EDIT I need to take care of newlines.

Would: <\?xml[^>]*?> do the trick?

EDIT2

Thanks for the help. Regex wise <\?xml.*?\?> worked fine. I ended up writing some timing code and testing both using ar egex, and IndexOf(). I found, that for our simplest use case, JUST the declaration stripping took:

  • Nearly a second as it was
  • .01 of a second with the regex
  • untimable using a loop and IndexOf()

So I went for IndexOf() as it's easy a very simple loop.

Custer answered 8/11, 2010 at 15:26 Comment(3)
Check and see if it is actually valid XML first - if so then there are better and still efficient answers. Is there even an attempt at an XSD?Triumphal
@annakata: no XDS, it's all just been hand written / cobbled together, or so it seems. I suspect the first thing that would happen if attempting to parse this would be that it'd fail because of multiple xml declarations. My plan, ultimately, is to make each class serialise itself to string VIA an XmlDoc or equivalent to get valid XML out.Custer
It probably will fail, but the first step is to check and the second thing to do is to double check that something like beautifulsoup won't save you.Triumphal
C
9

You probably want either this: <\?xml.*\?> or this: <\?xml.*?\?>, because the way you have it now, the regex is not looking for '?>' but just for '>'. I don't think you want the first option, because it's greedy and it will remove everything between the first occurrence of ''. The second option will work as long as you don't have nested XML-tags. If you do, it will remove everything between the first ''. If you have another '' tag.

Also, I don't know how regexes are implemented in .NET, but I seriously doubt if they're faster than using indexOf.

Crowd answered 8/11, 2010 at 15:39 Comment(2)
Thanks. I just add /i modifier for make it case-insensetive. /<\?xml.*\?>/iCheops
@ArthurShlain xml tags are case sensitive. You might have a good reason to go beyond the spec but we should still acknowledge it. /i matches more than an XML parser will allow.Writing
P
-1
strXML = strXML.Remove(0, sXMLContent.IndexOf(@"?>", 0) + 2);
Poleaxe answered 19/1, 2016 at 11:26 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.