I have to read some quite heavy XML files (between 200 MB and 1 GB) that are, for some of them, invalid. Let me give you a small example :
<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:g="http://base.google.com/ns/1.0" version="2.0">
<item>
<title>Some article</title>
<g:material><ul><li>50 % Coton</li><li>50% Lyocell</li></g:material>
</item>
</rss>
Obviously, there is a missing </ul>
closing tag in the g:material
tag. Moreover, people that have developed this feed should have enclosed g:material
content into CDATA
, which they did not... Basically, that's what I want to do : add this missing CDATA
section.
I've tried to use a SAX parser to read this file but it fails when reading the </g:material>
tag since the </ul>
tag is missing. I've tried with XMLReader but got basically the same issue.
I could probably do something with DomDocument::loadHtml but the size of this file is not really compatible with a DOM approach.
Do you have any idea how I could simply repair this feed without having to buy lots of RAM for DomDocument to work ?
Thanks.