I'm trying to parse a massive xml file into my MySQL database. the file is 4.7gb. I know, its insane.
The data comes from here: http://www.discogs.com/data/ (the newest album xml is 700mb zipped and 4.7gb unzipped)
I can either use java or php to parse and update the database. I assume that java is the smarter idea.
I need to find a way to parse the xml without filling my 4gb of ram, and load it into the db.
What is the smartest way of doing this? I've heard of SAX parsers, am I thinking in the right direction?
For now, I don't care about downloading the images from those urls, I just want the data in my database. I have not yet designed the tables yet, but I'm more interested in the xml side right now.
I used php's fread() to open the file's first 1000 bites, so at least I can see what it looks like, here's a sample of the structure of the first album in the file:
<releases>
<release id="1" status="Accepted">
<images>
<image height="600" type="primary" uri="http://s.dsimg.com/image/R-1-1193812031.jpeg" uri150="http://s.dsimg.com/image/R-150-1-1193812031.jpeg" width="600" />
<image height="600" type="secondary" uri="http://s.dsimg.com/image/R-1-1193812053.jpeg" uri150="http://s.dsimg.com/image/R-150-1-1193812053.jpeg" width="600" />
<image height="600" type="secondary" uri="http://s.dsimg.com/image/R-1-1193812072.jpeg" uri150="http://s.dsimg.com/image/R-150-1-1193812072.jpeg" width="600" />
<image height="600" type="secondary" uri="http://s.dsimg.com/image/R-1-1193812091.jpeg" uri150="http://s.dsimg.com/image/R-150-1-1193812091.jpeg" width="600" />
</images>
<artists>
<artist>
<name>Persuader, The</name>
</artist>
</artists>
<title>Stockholm</title>
<labels>
<label catno="SK032" name="Svek" />
</labels>
<formats>
<format name="Vinyl" qty="2">
<descriptions>
<description>12"</description>
</descriptions>
</format>
</formats>
<genres>
<genre>Electronic</genre>
</genres>
<styles>
<style>Deep House</style>
</styles>
<country>Sweden</country>
<released>1999-03-00</released>
<notes>Recorded at the Globe studio in Stockholm. The titles are the names of Stockholm's districts.</notes>
<master_id>5427</master_id>
<tracklist>
<track>
<position>A</position>
<title>Östermalm</title>
<duration>4:45</duration>
</track>
<track>
<position>B1</position>
<title>Vasastaden</title>
<duration>6:11</duration>
</track>
<track>
<position>B2</position>
<title>Kungsholmen</title>
<duration>2:49</duration>
</track>
<track>
<position>C1</position>
<title>Södermalm</title>
<duration>5:38</duration>
</track>
<track>
<position>C2</position>
<title>Norrmalm</title>
<duration>4:52</duration>
</track>
<track>
<position>D</position>
<title>Gamla Stan</title>
<duration>5:16</duration>
</track>
</tracklist>
</release>
Thanks.