Conditionally escape special xml characters
Asked Answered
F

3

6

I have looked around a lot but have not been able to find a built-in .Net method that will only escape special XML characters: <, >, &, ' and " if it's not a tag.

For example, take the following text:

Test& <b>bold</b> <i>italic</i> <<Tag index="0" />

I want it to be converted to:

Test&amp; <b>bold</b> <i>italic</i> &lt;<Tag index="0" />

Notice that the tags are not escaped. I basically need to set this value to an InnerXML of an XmlElement and as a result, those tags must be preserved.

I have looked into implementing my own parser and use a StringBuilder to optimize it as much as I can but it can get pretty nasty.

I also know the tags that are acceptable which may simplify things (only: br, b, i, u, blink, flash, Tag). In addition, these tags can be self closing tags

(e.g. <u />)

or container tags

(e.g. <u>...</u>)
Fanni answered 19/12, 2012 at 22:20 Comment(6)
HTML is not XML... like <b>foo <i>bar</b> really <br></i>. You are in for plenty of fun if you want to do that yourself. As option consider HtmlAgilityPack to parse HTML into a reasonable tree and carefully insert all nodes into XML...Checkbook
Nothing you could do simply would correctly handle Test Value is < 3 but > 1.Aloeswood
@Aloeswood < 3 isn't a valid start tag, so you could figure that out. But your point still stands, < and > are escaped to remove ambiguity in parsing. There are going to be cases where any reasonable parser would choose one path, while you may have wanted another.Cherianne
@Aloeswood I just edited the post. I already know the type of tags that are acceptable. In addition, < 3 but > 1 will have to be escaped because an element cannot start with whitespaceFanni
@Fanni It's not the best example, but the point was valid. I could have said <3 but >1. Allowing only a known list of tags makes it much easier, though.Aloeswood
What exactly do you want to do with the escaped characters? If you're trying to add them to XML, then just use LINQ to XML or one of the other XML APIs to write the text. They know how to escape it.Grueling
C
3

NOTE: This could probably be optimised. It was just something I knocked up quickly for you. Also note that I am not doing any validation of the tags themselves. It's just looking for content wrapped in angle brackets. It will also fail if an angle bracket was found within the tag (e.g. <sometag label="I put an > here"> ). Other than that, I think it should do what you're asking for.

namespace ConsoleApplication1
{
    using System;
    using System.Text.RegularExpressions;

    class Program
    {
        static void Main(string[] args)
        {
            // This is the test string.
            const string testString = "Test& <b>bold</b> <i>italic</i> <<Tag index=\"0\" />";

            // Do a regular expression search and replace. We're looking for a complete tag (which will be ignored) or
            // a character that needs escaping.
            string result = Regex.Replace(testString, @"(?'Tag'\<{1}[^\>\<]*[\>]{1})|(?'Ampy'\&[A-Za-z0-9]+;)|(?'Special'[\<\>\""\'\&])", (match) =>
                {
                    // If a special (escapable) character was found, replace it.
                    if (match.Groups["Special"].Success)
                    {
                        switch (match.Groups["Special"].Value)
                        {
                            case "<":
                                return "&lt;";
                            case ">":
                                return "&gt;";
                            case "\"":
                                return "&quot;";
                            case "\'":
                                return "&apos;";
                            case "&":
                                return "&amp;";
                            default:
                                return match.Groups["Special"].Value;
                        }
                    }

                    // Otherwise, just return what was found.
                    return match.Value;
                });

            // Show the result.
            Console.WriteLine("Test String: " + testString);
            Console.WriteLine("Result     : " + result);
            Console.ReadKey();
        }
    }
}
Colza answered 19/12, 2012 at 22:54 Comment(4)
That breaks valid HTML. For example, it converts &amp; to &amp;amp;.Conflagrant
@NigelWhatling Very nice, well done! The only flaw is that unsupported tags don't get escaped (e.g. <Invalid> is not escaped)Fanni
@Fanni Thanks. I responded before you edited your original question and added the defined set of tags. It's not hard to change the regular expression to only capture that set of tags and escape everything else.Colza
This is a really nice solution. I ran into this exact scenario today and wanted to let you know that it's still working. Thanks!!Dressy
H
2

I personally don't think it is possible, because you are really trying to fix malformed HTML, and therefore there are no rules which you can use to determine what is to be encoded and what isn't.

Any which way you look at it, something like <<Tag index="0" /> is not valid HTML.

If you know the actual tags you may be able create a white list which could simplify things, but you are going to have to attack your problem more specifically, I do not think you will be able to solve this for any scenario.

In fact, chances are you haven't actually got any random < or > lying around in your text, and that would (probably) greatly simplify the problem, but if you are really trying to come up with a generic solution....I wish you luck.

Homothermal answered 19/12, 2012 at 22:40 Comment(1)
It wouldn't be possible, except he's already allowing just a very small set of valid tags.Aloeswood
A
1

Here's a regular expression you can use that will match any invalid < or >.

(\<(?! ?/?(?:b|i|br|u|blink|flash|Tag[^>]*))|(?<! ?/?(?:b|i|br|u|blink|flash|Tag[^>]*))\>)

I suggest putting the valid tag-test expression into a variable and then constructing the rest around it.

var validTags = "b|i|br|u|blink|flash|Tag[^>]*";
var startTag = @"\<(?! ?/?(?:" + validTags + "))";
var endTag = @"(?<! ?/?(?:" + validTags + "))/>";

Then just do RegEx.Replace on these.

Aloeswood answered 19/12, 2012 at 23:19 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.