pythonfilesvmlight

Extracting a random line in a file without loading the file into RAM in python


I have big svmlight files that I'm using for machine learning purpose. I'm trying to see if a sumsampling of those files would lead to good enough results.

I want to extract random lines of my files to feed them into my models but I want to load the less possible information in RAM.

I saw here (Read a number of random lines from a file in Python) that I could use linecache but all the solution end up loading everything in memory.

Could someone give me some hints? Thank you.

EDIT : forgot to say that I know the number of lines in my files beforehand.


Solution

  • You can use a heapq to select n records based on a random number, eg:

    import heapq
    import random
    
    SIZE = 10
    with open('yourfile') as fin:
        sample = heapq.nlargest(SIZE, fin, key=lambda L: random.random())
    

    This is remarkably efficient as the heapq remains fixed size, it doesn't require a pre-scan of the data and elements get swapped out as other elements get chosen instead - so at most you'll end up with SIZE elements in memory at once.