How to parse an extremely large xml file in either java or php and insert into a mysql DB [closed]
Asked Answered
H

7

2

I'm trying to parse a massive xml file into my MySQL database. the file is 4.7gb. I know, its insane.

The data comes from here: http://www.discogs.com/data/ (the newest album xml is 700mb zipped and 4.7gb unzipped)

I can either use java or php to parse and update the database. I assume that java is the smarter idea.

I need to find a way to parse the xml without filling my 4gb of ram, and load it into the db.

What is the smartest way of doing this? I've heard of SAX parsers, am I thinking in the right direction?

For now, I don't care about downloading the images from those urls, I just want the data in my database. I have not yet designed the tables yet, but I'm more interested in the xml side right now.

I used php's fread() to open the file's first 1000 bites, so at least I can see what it looks like, here's a sample of the structure of the first album in the file:

<releases>
<release id="1" status="Accepted">
    <images>
        <image height="600" type="primary" uri="http://s.dsimg.com/image/R-1-1193812031.jpeg" uri150="http://s.dsimg.com/image/R-150-1-1193812031.jpeg" width="600" />
        <image height="600" type="secondary" uri="http://s.dsimg.com/image/R-1-1193812053.jpeg" uri150="http://s.dsimg.com/image/R-150-1-1193812053.jpeg" width="600" />
        <image height="600" type="secondary" uri="http://s.dsimg.com/image/R-1-1193812072.jpeg" uri150="http://s.dsimg.com/image/R-150-1-1193812072.jpeg" width="600" />
        <image height="600" type="secondary" uri="http://s.dsimg.com/image/R-1-1193812091.jpeg" uri150="http://s.dsimg.com/image/R-150-1-1193812091.jpeg" width="600" />
    </images>
    <artists>
        <artist>
            <name>Persuader, The</name>
        </artist>
    </artists>
    <title>Stockholm</title>
    <labels>
        <label catno="SK032" name="Svek" />
    </labels>
    <formats>
        <format name="Vinyl" qty="2">
            <descriptions>
                <description>12"</description>
            </descriptions>
        </format>
    </formats>
    <genres>
        <genre>Electronic</genre>
    </genres>
    <styles>
        <style>Deep House</style>
    </styles>
    <country>Sweden</country>
    <released>1999-03-00</released>
    <notes>Recorded at the Globe studio in Stockholm. The titles are the names of Stockholm's districts.</notes>
    <master_id>5427</master_id>
    <tracklist>
        <track>
            <position>A</position>
            <title>Östermalm</title>
            <duration>4:45</duration>
        </track>
        <track>
            <position>B1</position>
            <title>Vasastaden</title>
            <duration>6:11</duration>
        </track>
        <track>
            <position>B2</position>
            <title>Kungsholmen</title>
            <duration>2:49</duration>
        </track>
        <track>
            <position>C1</position>
            <title>Södermalm</title>
            <duration>5:38</duration>
        </track>
        <track>
            <position>C2</position>
            <title>Norrmalm</title>
            <duration>4:52</duration>
        </track>
        <track>
            <position>D</position>
            <title>Gamla Stan</title>
            <duration>5:16</duration>
        </track>
    </tracklist>
</release>

Thanks.

Howardhowarth answered 14/2, 2011 at 18:52 Comment(1)
Use extended vtd-xml. We have got overwhelmingly positive response from users dealing with far bigger xml files.Arenicolous
A
1

I have faced some time ago with similar problem. Here is part of script witch imported around 28MB file, not reading whole data into memory. It should work perhaps :). It reads it by XML nodes, in memory stays only little part of XML. Script will need little modications to fit your needs.

$reader = new XMLReader();
$reader->open(<path_to_large_xml_file>);    
while ($reader->read()) {
    switch ($reader->nodeType) {
        case (XMLREADER::ELEMENT):
        if ($reader->localName == "Table") {

            $node = $reader->expand();
            $dom = new DomDocument();
            $n = $dom->importNode($node,true);
            $dom->appendChild($n);
            $sxe = simplexml_import_dom($n);

            $Data = array();
            $DataColumns = array();

            foreach ($columns as $key => $column)
            {

                if (in_array($key,$DateColumns))
                {
                    $DateArray = explode('/',substr(trim($sxe->$column),0,10));   
                    $ValueColumn = date('Y-m-d H:i:s',mktime(0,0,0,$DateArray[1],$DateArray[0],$DateArray[2]));
                    $Data[] = '\''.$ValueColumn.'\'';
                    $DataColumns[] = $key;

                    if ($SplitDateInsert == 'enabled')
                    {
                        $Data[] = '\''.$DateArray[2].'\'';
                        $Data[] = '\''.$DateArray[1].'\'';
                        $Data[] = '\''.$DateArray[0].'\'';

                        $DataColumns[] = $key.'_year';
                        $DataColumns[] = $key.'_month';
                        $DataColumns[] = $key.'_day';                            
                    }

                } else {
                    $ValueColumn = addslashes(trim($sxe->$column));
                    $Data[] = '\''.$ValueColumn.'\'';
                    $DataColumns[] = $key;
                }                   

            }               
                $SQL = "INSERT INTO {$tableName} (".implode(',',$DataColumns).") VALUES (".implode(',',$Data).")";                  
                $db->query($SQL);                       

        } // END IF table
    }
}
Antarctica answered 14/2, 2011 at 19:4 Comment(2)
where did the variable $columns in the for each come from?Howardhowarth
I ditched the foreach loop and manually used the values I needed.Howardhowarth
R
2

You obviously need a streaming API rather that a DOM, which would need to hold the whole document in memory. Java supports SAX and Stax. I've never used Stax myself, but heard it was easier to use than SAX, while still efficient.

Make sure to split the work into many transactions as well : the database won't be able to support as many insert statements in a single transaction.

Recrudescence answered 14/2, 2011 at 19:7 Comment(0)
S
1

If I where to parse it using PHP I would do it in two steps:

  1. Split the file every several <release> nod and make that minifile a valid XML.
  2. Parse each resulting file separately

If speed is not of the essence, then PHP would actually be better, as it is easy to parse texts/XML in PHP.

Steradian answered 14/2, 2011 at 18:57 Comment(3)
How would I actually split the files without loading the whole document in to parse it and split it programatically?Howardhowarth
If you use the low level file manipulation functions of PHP, you do not load all the file, but a set number of lines/bytes each call. It is like working with a cursor when you have a query with huge result set.Steradian
LOAD DATA LOCAL INFILE ... ROWS INDENTIFIED BY '</release>'. Than you can select each row and prepend '</release>' (close tag) and compose valid XML structure and parse it on-by-one, if we take into account this particular approach.Riancho
A
1

I have faced some time ago with similar problem. Here is part of script witch imported around 28MB file, not reading whole data into memory. It should work perhaps :). It reads it by XML nodes, in memory stays only little part of XML. Script will need little modications to fit your needs.

$reader = new XMLReader();
$reader->open(<path_to_large_xml_file>);    
while ($reader->read()) {
    switch ($reader->nodeType) {
        case (XMLREADER::ELEMENT):
        if ($reader->localName == "Table") {

            $node = $reader->expand();
            $dom = new DomDocument();
            $n = $dom->importNode($node,true);
            $dom->appendChild($n);
            $sxe = simplexml_import_dom($n);

            $Data = array();
            $DataColumns = array();

            foreach ($columns as $key => $column)
            {

                if (in_array($key,$DateColumns))
                {
                    $DateArray = explode('/',substr(trim($sxe->$column),0,10));   
                    $ValueColumn = date('Y-m-d H:i:s',mktime(0,0,0,$DateArray[1],$DateArray[0],$DateArray[2]));
                    $Data[] = '\''.$ValueColumn.'\'';
                    $DataColumns[] = $key;

                    if ($SplitDateInsert == 'enabled')
                    {
                        $Data[] = '\''.$DateArray[2].'\'';
                        $Data[] = '\''.$DateArray[1].'\'';
                        $Data[] = '\''.$DateArray[0].'\'';

                        $DataColumns[] = $key.'_year';
                        $DataColumns[] = $key.'_month';
                        $DataColumns[] = $key.'_day';                            
                    }

                } else {
                    $ValueColumn = addslashes(trim($sxe->$column));
                    $Data[] = '\''.$ValueColumn.'\'';
                    $DataColumns[] = $key;
                }                   

            }               
                $SQL = "INSERT INTO {$tableName} (".implode(',',$DataColumns).") VALUES (".implode(',',$Data).")";                  
                $db->query($SQL);                       

        } // END IF table
    }
}
Antarctica answered 14/2, 2011 at 19:4 Comment(2)
where did the variable $columns in the for each come from?Howardhowarth
I ditched the foreach loop and manually used the values I needed.Howardhowarth
I
0

Assuming that MySQL has similar capabilities like Oracle in that regard, why not let the DB handle the parsing? In oracle you could just register the XMLSchema, create a structured XMLType table (probably more useful than a clob) and then insert the file.

Never used it for anything quite that large, but I don't see why it shouldn't work and it's done in a few lines of code. You'd only need someone with MySQL experience to tell how the details work.

Imprint answered 14/2, 2011 at 19:49 Comment(0)
M
0

I would second suggestion to use Stax on Java. Or, even simpler, StaxMate which adds bit more convenience to access.

But what exactly do you need to do with XML? To just stream it in DB, there are ways to handle BLOBs as streams (although mySQL JDBC drivers are notorious for quirks) so that should be doable.

Metaprotein answered 14/2, 2011 at 19:59 Comment(0)
A
0

4.7 GB is not insane, merely mild. If you have 64-bit Java, extended VTD-XML should be the most efficient and easy to use option available.

Arenicolous answered 14/2, 2011 at 20:24 Comment(0)
D
0

You haven't said what processing you want to do with the XML. You could consider using Saxon's streaming-mode XSLT (it needs the Saxon-EE product, which costs money) - if the processing is essentially a "burst-mode streaming" exercise of processing each "release" element in turn, then it shouldn't be too difficult. Of course you can also use low-level SAX or StaX parsing, but that will almost certainly require developing and debugging rather more code.

Duodecimo answered 14/2, 2011 at 23:0 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.