First of all, I know this is a bad solution and I shouldn't be doing this.
Background: Feel free to skip
However, I need a quick fix for a live system. We currently have a data structure which serialises itself to a string by creating "xml" fragments via a series of string builders. Whether this is valid XML I rather doubt. After creating this xml, and before sending it over a message queue, some clean-up code searches the string for occurrences of the xml declaration and removes them.
The way this is done (iterate every character doing indexOf for the <?xml
) is so slow its causing thread timeouts and killing our systems. Ultimately I'll be trying to fix this properly (build xml using xml documents or something similar) but for today I need a quick fix to replace what's there.
Please bear in mind, I know this is a far from ideal solution, but I need a quick fix to get us back up and running.
Question
My thought to use a regex to find the declarations. I was planning on: <\?xml.*?>
, then using Regex.Replace(input, string.empty)
to remove.
Could you let me know if there are any glaring problems with this regex, or whether just writing it in code using string.IndexOf("<?xml")
and string.IndexOf("?>")
pairs in a (much saner) loop is better.
EDIT I need to take care of newlines.
Would: <\?xml[^>]*?>
do the trick?
EDIT2
Thanks for the help. Regex wise <\?xml.*?\?>
worked fine. I ended up writing some timing code and testing both using ar egex, and IndexOf()
. I found, that for our simplest use case, JUST the declaration stripping took:
- Nearly a second as it was
- .01 of a second with the regex
- untimable using a loop and
IndexOf()
So I went for IndexOf()
as it's easy a very simple loop.