linuxunixawkrandomfile-processing

Randomly Pick Lines From a File Without Slurping It With Unix


I have a 10^7 lines file, in which I want to choose 1/100 of lines randomly from the file. This is the AWK code I have, but it slurps all the file content before hand. My PC memory cannot handle such slurps. Is there other approach to do it?

awk 'BEGIN{srand()}
!/^$/{ a[c++]=$0}
END {  
  for ( i=1;i<=c ;i++ )  { 
    num=int(rand() * c)
    if ( a[num] ) {
        print a[num]
        delete a[num]
        d++
    }
    if ( d == c/100 ) break
  }
 }' file

Solution

  • if you have that many lines, are you sure you want exactly 1% or a statistical estimate would be enough?

    In that second case, just randomize at 1% at each line...

    awk 'BEGIN {srand()} !/^$/ { if (rand() <= .01) print $0}'
    

    If you'd like the header line plus a random sample of lines after, use:

    awk 'BEGIN {srand()} !/^$/ { if (rand() <= .01 || FNR==1) print $0}'