I am parsing an extremely large JSON file using IJSON and then writing the contents to a temp file. Afterwards, I overwrite the original file with the contents of the temp file.
FILE_NAME = 'file-name'
DIR_PATH = 'path'
#Generator function that yields dictionary objects.
def constructDictionary():
data = open(os.path.join(DIR_PATH, FILE_NAME + ".json"), "rb")
row = ijson.items(data,'item')
for record in row:
yield record
data.close()
def writeToTemp(row, temp):
#Needs to add a comma
json.dump(row, temp)
def writeTempToFile(temp):
temp.seek(0)
data = open(os.path.join(DIR_PATH, FILE_NAME + ".json"), "wb")
data.write(b'[')
for line in temp:
data.write(line.encode('utf-8'))
data.write(b']')
data.close()
if __name__ == "__main__":
temp = tempfile.NamedTemporaryFile(mode = 'r+')
for row in constructDictionary():
writeToTemp(row,temp)
writeTempToFile(temp)
temp.close()
My issue is that I end up with the JSON objects being written without commas between them. I can't parse over the file again and add the missing commas as it would take way too long. Ideally, while writing i would be able to add a comma at the end of each json.dump(). But, how would i handle the final entry?
Some way to determine when the generator function has reached the end of the file? Then i would use a flag or pass a variable so that it wouldn't write the final comma.
Or, i could use file.seek() to go to the character before the final character and remove it. But that sounds not good.
I would appreciate any suggestions, thank you.
Ideally, while writing i would be able to add a comma at the end of each json.dump(). But, how would i handle the final entry?
I suggest taking different view - rather than writing comma after each but last element, writing comma before each but first element. This way it is enough to next
once before using generator normal way, consider following simple example: I want to print 10 times A
sheared by *
, then I can do:
import itertools
a10 = itertools.repeat("A", 10)
print(next(a10), end='')
for i in a10:
print('*', end='')
print(i, end='')
output:
A*A*A*A*A*A*A*A*A*A