Clean namespace handling with dom4j
Asked Answered
C

5

9

We are using dom4j 1.6.1, to parse XML comming from somewhere. Sometime, the balise have mention of the namespace ( eg : ) and sometime not ( ). And it's make call of Element.selectSingleNode(String s ) fails.

For now we have 3 solutions, and we are not happy with them

1 - Remove all namespace occurence before doing anything with the xml document

xml = xml .replaceAll("xmlns=\"[^\"]*\"","");
xml = xml .replaceAll("ds:","");
xml = xml .replaceAll("etm:","");
[...] // and so on for each kind of namespace

2 - Remove namespace just before getting a node By calling

Element.remove(Namespace ns)

But it's works only for a node and the first level of child

3 - Clutter the code by

node = rootElement.selectSingleNode(NameWithoutNameSpace)
if ( node == null )
    node = rootElement.selectSingleNode(NameWithNameSpace)

So ... what do you think ? Witch one is the less worse ? Have you other solution to propose ?

Cupriferous answered 14/9, 2009 at 15:48 Comment(0)
T
6

I wanted to remove any namespace information(declaration and tag) to ease the xpath evaluation. I end up with this solution :

String xml = ...
SAXReader reader = new SAXReader();
Document document = reader.read(new ByteArrayInputStream(xml.getBytes()));
document.accept(new NameSpaceCleaner());
return document.asXML();

where the NameSpaceCleaner is a dom4j visitor :

private static final class NameSpaceCleaner extends VisitorSupport {
    public void visit(Document document) {
        ((DefaultElement) document.getRootElement())
                .setNamespace(Namespace.NO_NAMESPACE);
        document.getRootElement().additionalNamespaces().clear();
    }
    public void visit(Namespace namespace) {
        namespace.detach();
    }
    public void visit(Attribute node) {
       if (node.toString().contains("xmlns")
        || node.toString().contains("xsi:")) {
        node.detach();
      }
    }

    public void visit(Element node) {
        if (node instanceof DefaultElement) {
        ((DefaultElement) node).setNamespace(Namespace.NO_NAMESPACE);
        }
         }
 }
Transmission answered 18/8, 2011 at 12:3 Comment(2)
Namespace.detach() doesn't seem to do anything, at least in my Document the Namespace instances had null parents and null document properties, preventing detach from working. I had to use the parent Element to get rid of the strange redundant (all Elements have a QName property which is actually used) Element Namespace child-nodes. This was with dom4j-1.6.1.Scarlettscarp
Attention. If you go to the source code of reader.read(), you will find it will parse the xml content with namesapce aware setting to true (hardcoded dom4j 1.6).Centiliter
L
5

Following is some code that i had found and now use. Might be useful, if looking for a generic way, to remove all namespaces from a dom4j document.

    public static void removeAllNamespaces(Document doc) {
        Element root = doc.getRootElement();
        if (root.getNamespace() !=
                Namespace.NO_NAMESPACE) {            
                removeNamespaces(root.content());
        }
    }

    public static void unfixNamespaces(Document doc, Namespace original) {
        Element root = doc.getRootElement();
        if (original != null) {
            setNamespaces(root.content(), original);
        }
    }

    public static void setNamespace(Element elem, Namespace ns) {

        elem.setQName(QName.get(elem.getName(), ns,
                elem.getQualifiedName()));
    }

    /**
     *Recursively removes the namespace of the element and all its
    children: sets to Namespace.NO_NAMESPACE
     */
    public static void removeNamespaces(Element elem) {
        setNamespaces(elem, Namespace.NO_NAMESPACE);
    }

    /**
     *Recursively removes the namespace of the list and all its
    children: sets to Namespace.NO_NAMESPACE
     */
    public static void removeNamespaces(List l) {
        setNamespaces(l, Namespace.NO_NAMESPACE);
    }

    /**
     *Recursively sets the namespace of the element and all its children.
     */
    public static void setNamespaces(Element elem, Namespace ns) {
        setNamespace(elem, ns);
        setNamespaces(elem.content(), ns);
    }

    /**
     *Recursively sets the namespace of the List and all children if the
    current namespace is match
     */
    public static void setNamespaces(List l, Namespace ns) {
        Node n = null;
        for (int i = 0; i < l.size(); i++) {
            n = (Node) l.get(i);

            if (n.getNodeType() == Node.ATTRIBUTE_NODE) {
                ((Attribute) n).setNamespace(ns);
            }
            if (n.getNodeType() == Node.ELEMENT_NODE) {
                setNamespaces((Element) n, ns);
            }            
        }
    }

Hope this is useful for someone who needs it!

Landry answered 26/8, 2010 at 8:5 Comment(3)
couldn't make this code work. I used xml with namespaces sample from w3schools, but it's like dom4j doesn't recognize the namespaces. The first if (root.getNamespace() != Namespace.NO_NAMESPACE) evaluates to true, and even if I remove the if, it still does nothing.Hiphuggers
Hi Dan, This does remove the namespaces from the document. Probably you are interested in removing the prefixes as well.Landry
Sorry, By mistake i saved before completing what i wanted to write! Dan, This function does remove the namespaces from the document. I tried this w/ the 5th example from the w3schools. You can verify this by creating an xpath like "//table". Run this xpath on the document before and after calling the removeNamespaces function, and you'll see that the latter one will find the nodes for you. What exactly are you trying to do ? I doubt if you are more interested in just removing the prefixs, for e.g (h:table -> table) ? Let me know if i can be of any help!Landry
R
1

Option 1 is dangerous because you can't guarantee the prefixes for a given namespace without pre-parsing the document, and because you can end up with namespace collision. If you're consuming a document and not outputting anything, it might be ok, depending on the source of the doc, but otherwise it just loses too much information.

Option 2 could be applied recursively but its got many of the same problems as option 1.

Option 3 sounds like the best approach, but rather than clutter your code, make a static method that does both checks rather than putting the same if statement throughout your codebase.

The best approach is to get whoever is sending you the bad XML to fix it. Of course this begs the question is it actually broken. Specifically, are you getting XML where the default namespace is defined as X and then a namespace also representing X is given a prefix of 'es'? If this is the case then the XML is well formed and you just need code that is agnostic about the prefix, but still uses a qualified name to fetch the element. I'm not familiar enough with Dom4j to know if creating a Namespace with a null prefix will cause it to match all elements with a matching URI or only those with no prefix, but its worth experimenting with.

Riella answered 14/9, 2009 at 16:25 Comment(1)
I will try and dig the doc about namespace with null prefix. Thanks anyway. About the source of the XML file : theire is not way that they change anything. But the file with or without namespace are valid. With the files, we build objects, that we use in our system. But we never "writte" something. ( we dont have right to modify the xml file )Cupriferous
S
0

As Abhishek, I needed to strip the namespace from XML to simplify XPath queries in system testing scripts. (the XML is first XSD validated)

Here are the problems I faced:

  1. I needed to process deeply structured XML that had a tendency of blowing up the stack.
  2. On most complex XML, for a reason I didn't investigate fully, stripping all the namespaces only worked in reliably when traversing the DOM tree depth first. So that excluded the visitor, or getting the list of nodes with document.selectNodes("//*")

I ended up with the following (not the most elegant, but if that can help solving somebody's problem ...):

public static String normaliseXml(final String message) {
    org.dom4j.Document document;
    document = DocumentHelper.parseText(message);

    Queue stack = new LinkedList();

    Object current = document.getRootElement();

    while (current != null) {
        if (current instanceof Element) {
            Element element = (Element) current;

            Iterator iterator = element.elementIterator();

            if (iterator.hasNext()) {
                stack.offer(element);
                current = iterator;
            } else {
                stripNamespace(element);

                current = stack.poll();
            }
        } else {
            Iterator iterator = (Iterator) current;

            if (iterator.hasNext()) {
                stack.offer(iterator);
                current = iterator.next();
            } else {
                current = stack.poll();

                if (current instanceof Element) {
                    stripNamespace((Element) current);

                    current = stack.poll();
                }
            }
        }
    }

    return document.asXML();
}

private static void stripNamespace(Element element) {
    QName name = new QName(element.getName(), Namespace.NO_NAMESPACE, element.getName());
    element.setQName(name);

    for (Object o : element.attributes()) {
        Attribute attribute = (Attribute) o;

        QName attributeName = new QName(attribute.getName(), Namespace.NO_NAMESPACE, attribute.getName());
        String attributeValue = attribute.getValue();

        element.remove(attribute);

        element.addAttribute(attributeName, attributeValue);
    }

    for (Object o : element.declaredNamespaces()) {
        Namespace namespace = (Namespace) o;
        element.remove(namespace);
    }
}
Sophistry answered 23/3, 2013 at 1:54 Comment(0)
A
0

This code actually works:

public void visit(Document document) {
    ((DefaultElement) document.getRootElement())
            .setNamespace(Namespace.NO_NAMESPACE);
    document.getRootElement().additionalNamespaces().clear();
}

public void visit(Namespace namespace) {
    if (namespace.getParent() != null) {
        namespace.getParent().remove(namespace);
    }
}

public void visit(Attribute node) {
    if (node.toString().contains("xmlns")
            || node.toString().contains("xsi:")) {
        node.getParent().remove(node);
    }
}

public void visit(Element node) {
    if (node instanceof DefaultElement) {
        ((DefaultElement) node).setNamespace(Namespace.NO_NAMESPACE);
        node.additionalNamespaces().clear();
    }
}
Aboveground answered 15/11, 2014 at 8:5 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.