Remove empty XML tags
Asked Answered
D

6

13

I am looking for a good approach that can remove empty tags from XML efficiently. What do you recommend? Regex? XDocument? XmlTextReader?

For example,

const string original = 
    @"<?xml version=""1.0"" encoding=""utf-16""?>
    <pet>
        <cat>Tom</cat>
        <pig />
        <dog>Puppy</dog>
        <snake></snake>
        <elephant>
            <africanElephant></africanElephant>
            <asianElephant>Biggy</asianElephant>
        </elephant>
        <tiger>
            <tigerWoods></tigerWoods>       
            <americanTiger></americanTiger>
        </tiger>
    </pet>";

Could become:

const string expected = 
    @"<?xml version=""1.0"" encoding=""utf-16""?>
        <pet>
        <cat>Tom</cat>
        <dog>Puppy</dog>        
        <elephant>                                              
            <asianElephant>Biggy</asianElephant>
        </elephant>                                 
    </pet>";
Diaster answered 6/9, 2011 at 10:28 Comment(2)
I did a simple perfermance test yesterday, the XDocument is far better than regex in terms of peformance, I still haven't worked out how to implement it using XmlTextReader, in terms of complexity, XDocument is good enough for addressing my requirement, so i go for XDocument, Thank all your helps!Diaster
this might help #14509688Greaves
F
36

Loading your original into an XDocument and using the following code gives your desired output:

var document = XDocument.Parse(original);
document.Descendants()
        .Where(e => e.IsEmpty || String.IsNullOrWhiteSpace(e.Value))
        .Remove();
Forwardness answered 6/9, 2011 at 11:2 Comment(5)
This is a great answer, but it will remove elements that have attributes but not content, for example <asdf attr="val" /> would be removed, which may not be desireable. I've provided another answer based on this one to supplement that.Holleran
@DanField old question, but helps to add updated and/or better answers. You could have also updated my answer, if you would have liked. Anyway I upvoted your answer.Forwardness
What this line document.Descendants() will do?Silverpoint
@Silverpoint They tend to explain it well in the documentationForwardness
I use the same, It's only with information in innertextSwampy
H
19

This is meant to be an improvement on the accepted answer to handle attributes:

XDocument xd = XDocument.Parse(original);
xd.Descendants()
    .Where(e => (e.Attributes().All(a => a.IsNamespaceDeclaration || string.IsNullOrWhiteSpace(a.Value))
            && string.IsNullOrWhiteSpace(e.Value)
            && e.Descendants().SelectMany(c => c.Attributes()).All(ca => ca.IsNamespaceDeclaration || string.IsNullOrWhiteSpace(ca.Value))))
    .Remove();

The idea here is to check that all attributes on an element are also empty before removing it. There is also the case that empty descendants can have non-empty attributes. I inserted a third condition to check that the element has all empty attributes among its descendants. Considering the following document with node8 added:

<root>
  <node />
  <node2 blah='' adf='2'></node2>
  <node3>
    <child />
  </node3>
  <node4></node4>
  <node5><![CDATA[asdfasdf]]></node5>
  <node6 xmlns='urn://blah' d='a'/>
  <node7 xmlns='urn://blah2' />
  <node8>
     <child2 d='a' />
  </node8>
</root>

This would become:

<root>
  <node2 blah="" adf="2"></node2>
  <node5><![CDATA[asdfasdf]]></node5>
  <node6 xmlns="urn://blah" d="a" />
  <node8>
    <child2 d='a' />
  </node8>
</root>

The original and improved answer to this question would lose the node2 and node6 and node8 nodes. Checking for e.IsEmpty would work if you only want to strip out nodes like <node />, but it's redunant if you're going for both <node /> and <node></node>. If you also need to remove empty attributes, you could do this:

xd.Descendants().Attributes().Where(a => string.IsNullOrWhiteSpace(a.Value)).Remove();
xd.Descendants()
  .Where(e => (e.Attributes().All(a => a.IsNamespaceDeclaration))
            && string.IsNullOrWhiteSpace(e.Value))
  .Remove();

which would give you:

<root>
  <node2 adf="2"></node2>
  <node5><![CDATA[asdfasdf]]></node5>
  <node6 xmlns="urn://blah" d="a" />
</root>
Holleran answered 16/6, 2015 at 22:31 Comment(0)
I
2

As always, it depends on your requirements.

Do you know how the empty tag will display? (e.g. <pig />, <pig></pig>, etc.) I usually do not recommend using Regular Expressions (they are really useful but at the same time they are evil). Also considering a string.Replace approach seems to be problematic unless your XML doesn't have a certain structure.

Finally, I would recommend using an XML parser approach (make sure your code is valid XML).

var doc = XDocument.Parse(original);
var emptyElements = from descendant in doc.Descendants()
                    where descendant.IsEmpty || string.IsNullOrWhiteSpace(descendant.Value)
                    select descendant;
emptyElements.Remove();
Incunabulum answered 6/9, 2011 at 10:58 Comment(2)
You dont need the extra ForEach and Remove - the remove method acts on every element in the IEnumerable.Forwardness
+1 for actually providing the solution earlier than the accepted answer, which is just a slightly more elegant version of this one.Ludovika
P
0

XmlTextReader is preferable if we are talking about performance (it provides fast, forward-only access to XML). You can determine if tag is empty using XmlReader.IsEmptyElement property.

XDocument approach which produces desired output:

public static bool IsEmpty(XElement n)
{
    return n.IsEmpty 
        || (string.IsNullOrEmpty(n.Value) 
            && (!n.HasElements || n.Elements().All(IsEmpty)));
}

var doc = XDocument.Parse(original);
var emptyNodes = doc.Descendants().Where(IsEmpty);
foreach (var emptyNode in emptyNodes.ToArray())
{
    emptyNode.Remove();
}
Pinta answered 6/9, 2011 at 10:32 Comment(2)
The IsEmptyElement doesn't work if the element is <pig></pig>. It would work if the element is <pig />Diaster
@Ming, You can implement same logic as I provided for XDocument.Pinta
C
0

Anything you use will have to pass through the file once at least. If its just a single named tag that you know then regex is your friend otherwise use a stack approach. Start with parent tag and if it has a sub tag place it in stack. If you find an empty tag remove it then once you have gone through child tags and reached the ending tag of what you have on top of stack then pop it and check it as well. If its empty remove it as well. This way you can remove all empty tags including tags with empty children.

If you are after a reg ex expression use this

Cassaba answered 6/9, 2011 at 10:42 Comment(0)
M
0

XDocument is probably simplest to implement, and will give adequate performance if you know your documents are reasonably small.

XmlTextReader will be faster and use less memory than XDocument when processing very large documents.

Regex is best for handling text rather than XML. It might not handle all edge cases as you would like (e.g. a tag within a CDATA section; a tag with an xmlns attribute), so is probably not a good idea for a general implementation, but may be adequate depending on how much control you have of the input XML.

Meridel answered 6/9, 2011 at 10:58 Comment(2)
Thanks dude, I like XmlTextReader, I do play around it abit but can figure out a way to achieve my requirement. Do you have an example for it please?Diaster
@Ming, take a look at the following MSDN article, which describes how to chain an XmlReader to an XmlWriter, a technique that enables you to filter the XML in the way you want: msdn.microsoft.com/en-us/library/aa302289.aspxMeridel

© 2022 - 2024 — McMap. All rights reserved.