[SOLVED] Extracting a random line in a file without loading the file into RAM in python

Extracting a random line in a file without loading the file into RAM in python

I have big svmlight files that I'm using for machine learning purpose. I'm trying to see if a sumsampling of those files would lead to good enough results.

I want to extract random lines of my files to feed them into my models but I want to load the less possible information in RAM.

I saw here (Read a number of random lines from a file in Python) that I could use linecache but all the solution end up loading everything in memory.

Could someone give me some hints? Thank you.

EDIT : forgot to say that I know the number of lines in my files beforehand.

Solution

You can use a heapq to select n records based on a random number, eg:

import heapq
import random

SIZE = 10
with open('yourfile') as fin:
    sample = heapq.nlargest(SIZE, fin, key=lambda L: random.random())

This is remarkably efficient as the heapq remains fixed size, it doesn't require a pre-scan of the data and elements get swapped out as other elements get chosen instead - so at most you'll end up with SIZE elements in memory at once.