Parsing XML with no closing tags in Java
Asked Answered
C

2

-1

I am having trouble parsing an XML with no closing tag. Please see snippet of the xml below.

I have tried SAX and also StAX Parser they both need a properly formatted XML with closing tag XXYY....as you can see below the XML format is a little bit different... Please help me if there is any API out there that can help me parse this or if SAX/StAX can help me achieve what I want.... :(

<Employees>
 <Employee>
  <Detail>
    <Date>2018014
    <Name>XXYY
    <Age>0
    <LANGUAGE>ENG
    <Manager>
    <MName>YYXX
    <MID>5959
    </Manager>
    <EmployeeID>1234
  </Detail>
 </Employee>
</Employees>
Crescint answered 6/1, 2018 at 16:48 Comment(4)
Of course you are having trouble, because that is not valid XML. It is not XML at all, just happens to look like XML. Either fix the input to actually be XML, or write your own parser for reading that non-XML input.Caudell
(These "elements with non-closed-tags" show no structure. The world might be a better place if they were attributes. Why does this remind me of DocumentTypes?)Mckellar
Don't try to parse broken formats. Just put hands on the person who generated this class it its proper place.Gilliam
@Mckellar yeah thats what exactly is its an SGML markup language and for those of you who keep saying its "Not Valid" anything is possible to parse as along as you understand the structure.Crescint
C
2

You could "fix" the XML by adding all the missing end-tags.

Any start-tag that contains text after the tag, on the same line, could be fixed by adding an end-tag at the end of the line.

The rule of "contains text" ensures that e.g. the <Manager> tag doesn't get ended, since that is actually ended 3 lines down.

Example working code:

// Load file into memory
String xml = new String(Files.readAllBytes(Paths.get("test.xml")), StandardCharsets.UTF_8);

// Apply magic to add missing end-tags
xml = xml.replaceAll("(?m)^(\\s*)<(\\w+)>([^<]+)$", "$1<$2>$3</$2>");

// Parse then print the XML, to ensure there are no errors
Document document = DocumentBuilderFactory.newInstance().newDocumentBuilder()
                                          .parse(new InputSource(new StringReader(xml)));
TransformerFactory.newInstance().newTransformer()
                  .transform(new DOMSource(document), new StreamResult(System.out));
Caudell answered 6/1, 2018 at 17:7 Comment(0)
C
1

That appears to be SGML not XML. I've answered a newer question (for Javascript/node.js, but relevant to Java as well) detailing how to use the OpenSP SGML software to create XML from SGML.

Cirillo answered 22/5, 2018 at 6:24 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.