Removing duplicated newlines/tabs/whitespaces in XML character element
Asked Answered
F

1

6
<node> test
    test
    test
</node>

I want my XML parser read characters in <node> and:

  1. replace newlines and tabs to spaces and compose multiple spaces into one. At result, the text should look similar to "test test test".
  2. If the node contains XML encoded characters: tabs (&#x9;), newlines (&#xA;) or whitespaces (&#20;) - they should be left.

I'm trying a code below, but it preserve duplicated whitespaces.

  dbf = DocumentBuilderFactory.newInstance();
  dbf.setIgnoringComments( true );
  dbf.setNamespaceAware( namespaceAware );
  db = dbf.newDocumentBuilder();
  doc = db.parse( inputStream );

Is the any way to do what I want?

Thanks!

First answered 18/4, 2014 at 15:17 Comment(2)
try adding this line dbf.setIgnoringElementContentWhitespace(true);Chromatics
Unfortunately, this doesn't work. This property controls how to deal with white spaces in non-text elementsFirst
J
1

The first part - replacing multiple white-space - is relatively easy though I don't think the parser will do it for you:

InputSource stream = new InputSource(inputStream);
XPath xpath = XPathFactory.newInstance().newXPath();
Document doc = (Document) xpath.evaluate("/", stream, XPathConstants.NODE);

NodeList nodes = (NodeList) xpath.evaluate("//text()", doc,
    XPathConstants.NODESET);
for (int i = 0; i < nodes.getLength(); i++) {
  Text text = (Text) nodes.item(i);
  text.setTextContent(text.getTextContent().replaceAll("\\s{2,}", " "));
}

// check results
TransformerFactory.newInstance()
    .newTransformer()
    .transform(new DOMSource(doc), new StreamResult(System.out));

This is the hard part:

If the node contains XML encoded characters: tabs (&#x9;), newlines (&#xA;) or whitespaces (&#20;) - they should be left.

The parser will always turn "&#x9;" into "\t" - you may need to write your own XML parser.

According to the author of Saxon:

I don't think any XML parser will report numeric character references to the application - they will always be expanded. Really, your application shouldn't care about this any more than it cares about how much whitespace there is between attributes.

Jaclyn answered 18/4, 2014 at 16:22 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.