I'd like to read a large XML file that contains over a million small bibliographic records (like <article>...</article>
) using libxml in Ruby. I have tried the Reader class in combination with the expand
method to read record by record but I am not sure this is the right approach since my code eats up memory. Hence, I'm looking for a recipe how to conveniently process record by record with constant memory usage. Below is my main loop:
File.open('dblp.xml') do |io|
dblp = XML::Reader.io(io, :options => XML::Reader::SUBST_ENTITIES)
pubFactory = PubFactory.new
i = 0
while dblp.read do
case dblp.name
when 'article', 'inproceedings', 'book':
pub = pubFactory.create(dblp.expand)
i += 1
puts pub
pub = nil
$stderr.puts i if i % 10000 == 0
dblp.next
when 'proceedings','incollection', 'phdthesis', 'mastersthesis':
# ignore for now
dblp.next
else
# nothing
end
end
end
The key here is that dblp.expand
reads an entire subtree (like an <article>
record) and passes it as an argument to a factory for further processing. Is this the right approach?
Within the factory method I then use high-level XPath-like expression to extract the content of elements, like below. Again, is this viable?
def first(root, node)
x = root.find(node).first
x ? x.content : nil
end
pub.pages = first(node,'pages') # node contains expanded node from dblp.expand
When processing big XML files, you should use a stream parser to avoid loading everything in memory. There are two common approaches:
I think that push parsers are nice to use if you want to retrieve just some fields, but they are generally messy to use for complex data extraction and are often implemented whith use case... when...
constructs
Pull parser are in my opinion a good alternative between a tree-based model and a push parser. You can find a nice article in Dr. Dobb's journal about pull parsers with REXML .