I have a directory with a large number of files. I don't want to keep all the file names in memory, but I want to randomly get a subset of these files using a generator.
I can do this using information found in the post "Best way to choose a random file from a directory", but I would like to make sure that my generator never returns the same file twice. So eventually after running the generator (which would return batches) I would have cycled through the entire list of files in the directory.
The methods I can think of still creates a file list to compare against (Create a list of already used file names and return if not in list) and would take longer to execute the more times the generator yielded results.
Is there a way, if I create an array of numbers equal to the number of files in the directory, that when I randomly pop a value from the array, I could get the file at that position? (I think this would take up significantly less memory than an array of strings)
From current comments I have the following code:
def GetRandomFileListGenerator(self, path):
fileList = [f for f in listdir(path) if isfile(join(path, f))]
random.shuffle(fileList)
while(self.batchSize < len(fileList)):
yield fileList[:self.batchSize]
fileList = fileList[self.batchSize:]
I mentioned this approach in the comments, but I don't know if I explained it well, so I'll elaborate here.
You can use random.sample
to get multiple values from a collection without duplicates.
import random
def iterate_over_files_randomly():
the_filenames = ["a", "b", "c", "d", "e", "f"]
for filename in random.sample(the_filenames, len(the_filenames)):
yield filename
for filename in iterate_over_files_randomly():
print(filename)
You can also shuffle the list and iterate over that.
import random
def iterate_over_files_randomly():
the_filenames = ["a", "b", "c", "d", "e", "f"]
random.shuffle(the_filenames)
for filename in the_filenames:
yield filename
for filename in iterate_over_files_randomly():
print(filename)
In either case, the generator will go through the entire list of files in the directory, never repeating in any future sampling, until the list of files is exhausted. Sample output:
b
c
f
e
d
a
Both approaches have O(N) run-time. In other words, each additional value yielded takes the same amount of time as previous values yielded. This is due in part to the fact that the generator function does not slice or otherwise manipulate the list within its for
loop.