pythonnumpy

Does python reads all lines of a file when numpy.genfromtxt() is executed?


I have really large ASCII file (63 million lines or more) that I would like to read using numpy.genfromtxt(). But, it is taking up so much memory. I want to know what python actually does when numpy.genfromtxt() is executed. Does it read all the lines at once?

Look at the below code, for example.

import numpy as np
data = np.genfromtxt("large.file.txt")

When I execute the code above, would python read all the contents in large.file.txt and load it on to the memory? If yes, is there another way of reading a large file line-by-line so that python would not use large memory?


Solution

  • It reads all the lines. It has to. That data array has to hold all of the file's data, and NumPy can't build an array with all of the file's data without reading all of the file.

    That said, the implementation uses a lot more memory than the output needs. The implementation parses the requested columns of the file's data into a list of tuples before applying further processing, and a list of tuples takes a lot more memory than a NumPy array.

    If you want to use less intermediate memory, I think numpy.loadtxt is more efficient on that front - digging down into the implementation eventually hits a function that stores parsed data into an array directly, instead of using a list of tuples. numpy.loadtxt isn't as flexible as numpy.genfromtxt, but you don't seem to need the extra flexibility.

    This won't make data itself take any less memory, though. Also, numpy.loadtxt does still need extra intermediate memory. It should just be less intermediate memory than numpy.genfromtxt.