I need to convert a large (~100GB) file from XML format to jsonl format. I'm doing this in Python. I know I can read the XML file line by line by iterating over the file -
with open('file.xml', 'r') as file:
for line in file:
process_line(file)
Now that I can do this, how do I write the created jsonl data line by line without overwhelming my RAM? The structure of the XML file is such that there is one tag per line which is converted into one JSON object. Will doing something like this work?
with open('input.xml', 'r') as ip, open('output.jsonl', 'w') as op:
for line in ip:
op.write(process_line(line))
I'm guessing this will perform a write to the disk for every line? If so that would take forever. Is there any way I can write to disk in batches?
Python file handling already takes care of buffering I/O.
Just set the buffering
parameter on your open() calls.
See https://docs.python.org/3/library/functions.html#open
There are of course a lot of other ways to speed up processing huge files efficiently, depending on your actual processing needs.