PHP - Read and repair big invalid XML files

Asked 28/3, 2013 at 10:13 Answered 27/7, 2013 at 22:36

I have to read some quite heavy XML files (between 200 MB and 1 GB) that are, for some of them, invalid. Let me give you a small example :

<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:g="http://base.google.com/ns/1.0" version="2.0">
  <item>
    <title>Some article</title>
    <g:material><ul><li>50 % Coton</li><li>50% Lyocell</li></g:material>
  </item>
</rss>

Obviously, there is a missing </ul> closing tag in the g:material tag. Moreover, people that have developed this feed should have enclosed g:material content into CDATA, which they did not... Basically, that's what I want to do : add this missing CDATA section.

I've tried to use a SAX parser to read this file but it fails when reading the </g:material> tag since the </ul> tag is missing. I've tried with XMLReader but got basically the same issue. I could probably do something with DomDocument::loadHtml but the size of this file is not really compatible with a DOM approach. Do you have any idea how I could simply repair this feed without having to buy lots of RAM for DomDocument to work ? Thanks.

Lewd answered 28/3, 2013 at 10:13 Comment(4)

yes they should have done it. you could always try to regexp find/replace all your files if you know where the problems are. but it should not have been your concern at first place. – T 28/3, 2013 at 10:17

Hey Rémi, couldn't you read the string, hence adding the CDATA sections, before you push it to your xml loader? – Bluff 28/3, 2013 at 10:18

Yes, that's what I was thinking about and what I am doing right now but I still hope there is better thinks to do than reading XML character by character or do find/replace with regexp :) – Lewd 28/3, 2013 at 10:38

See similar (oldest) question: https://mcmap.net/q/1915747/-php-sax-parser-for-html/287948 – Retortion 27/7, 2013 at 23:18

If the files are too large to use the Tidy extension, you can use the tidy CLI tool to make the files parseable.

$ tidy -output my.clean.xml my.xml

After that, the XML files are well-formed, so you can parse them using the XMLReader. Since tidy adds the 'missing' (X)HTML parts, your original document's code is inside the element.

Gilgai answered 28/3, 2013 at 14:47 Comment(1)

ops, you can use Tidy extension for big files (see my answer below). And you can use PHP as a command for transform HTML files into XHTML by terminal. – Retortion 27/7, 2013 at 22:39

(copy from https://mcmap.net/q/1915747/-php-sax-parser-for-html)

Summarizing as two steps:

Use Tidy to transform "free HTML" into "good XHTML".
Use XML Parser to parse XHTML as XML by SAX API.

Use first Tidy (!), to transform "free HTML" into XHTML (or when you can not trust your "supposed XHTML"). See cleanRepair method. It needs more time, but runs with big files (!)... Set some minutes as maximum execution time if too big.

Another option (for work with big files) is to cache your XHTML files after checked or transformed into XHTML. See Tidy's repairfile method.

With a "trusted XHTML", use SAX... How to use SAX with PHP?

Parse XML with a SAX standard API, that in PHP is implemented by LibXML (see LibXML2 at xmlsoft.org), and its interface is the PHP's XML Parser, that is near to the SAX standard API.

Another way to use the "SAX of LibXML2", with another interface (a PHP iterator instead the traditional SAX interface), is to use XMLReader. See this explanation about "XMLReader use SAX".

Yes, the terms "SAX" or "SAX API" not expressed in the PHP manual (!). See this old but good introduction.

Retortion answered 27/7, 2013 at 22:36 Comment(0)

Recommended topics

Hot tags