bashawksedgreptext-processing

Extract N lines with no duplicate strings from either of the two first columns


I have a huge CSV file with pairs of unique combinations. I need to extract N random lines of this file, but I need that the strings in col 1&2 to be unique, so that the list of unique strings in col 1&2 strings combined is equal to 2*N.

For example given, this input file:

A,B,0.1747
B,C,0.373
C,D,0.585
E,J,0.8585
E,A,0.5657
F,A,0.5656

Possible desired output for 3 random lines would be:

A,B,0.1747
C,D,0.585
E,J,0.8585

or

B,C,0.373
E,J,0.8585
F,A,0.5656

So far I have tried to reduce the file to unique strings in both col 1 and 2 separately using:

shuf file.csv | awk  -F',' '!a[$1]++' | awk  -F',' '!a[$2]++'

to get

B,C,0.373
E,A,0.5657
A,B,0.1747
C,D,0.585

I don't think this is the correct approach. Is there a way of looping through rows and not consider the row if a string is already present in either of two columns?


Solution

  • When you use awk twice there is usually a better solution with only one awk script. You can try:

    shuf file.csv |
      awk -F, '!(a[$1]+a[$2]) {a[$1]++; a[$2]++; n++; print} n==200 {exit}'
    

    Variant using in and next:

    shuf file.csv |
      awk -F, '$1 in a || $2 in a {next} {a[$1]; a[$2]; n++; print} n==200 {exit}'
    

    Note: your example has no input lines with identical first and second fields. But if this can happen and you want to skip these lines you can use:

    shuf file.csv |
      awk -F, '($1!=$2) && !(a[$1]+a[$2]) {a[$1]++; a[$2]++; n++; print}
               n==200 {exit}'
    

    Or:

    shuf file.csv |
      awk -F, '$1==$2 || $1 in a || $2 in a {next} {a[$1]; a[$2]; n++; print}
               n==200 {exit}'