pythonmysqlxmlxml-import

Performance bulk-loading data from an XML file to MySQL


Should an import of 80GB's of XML data into MySQL take more than 5 days to complete?

I'm currently importing an XML file that is roughly 80GB in size, the code I'm using is in this gist and while everything is working properly it's been running for almost 5 straight days and its not even close to being done ...

The average table size is roughly:

Data size: 4.5GB
Index size: 3.2GB
Avg. Row Length: 245
Number Rows: 20,000,000

Let me know if more info is needed!

Server Specs:

Note this is a linode VPS

Intel Xeon Processor L5520 - Quad Core - 2.27GHZ 4GB Total Ram

XML Sample

https://gist.github.com/2510267

Thanks!


After researching more regarding this matter this seems to be average, I found this answer which describes ways to improve the import rate.


Solution

  • One thing which will help a great deal is to commit less frequently, rather than once-per-row. I would suggest starting with one commit per several hundred rows, and tuning from there.

    Also, the thing you're doing right now where you do an existence check -- dump that; it's greatly increasing the number of queries you need to run. Instead, use ON DUPLICATE KEY UPDATE (a MySQL extension, not standards-compliant) to make a duplicate INSERT automatically do the right thing.

    Finally, consider building your tool to convert from XML into a textual form suitable for use with the mysqlimport tool, and using that bulk loader instead. This will cleanly separate the time needed for XML parsing from the time needed for database ingestion, and also speed the database import itself by using tools designed for the purpose (rather than INSERT or UPDATE commands, mysqlimport uses a specialized LOAD DATA INFILE extension).