I have a huge CSV file with pairs of unique combinations. I need to extract N random lines of this file, but I need that the strings in col 1&2 to be unique, so that the list of unique strings in col 1&2 strings combined is equal to 2*N.
For example given, this input file:
A,B,0.1747
B,C,0.373
C,D,0.585
E,J,0.8585
E,A,0.5657
F,A,0.5656
Possible desired output for 3 random lines would be:
A,B,0.1747
C,D,0.585
E,J,0.8585
or
B,C,0.373
E,J,0.8585
F,A,0.5656
So far I have tried to reduce the file to unique strings in both col 1 and 2 separately using:
shuf file.csv | awk -F',' '!a[$1]++' | awk -F',' '!a[$2]++'
to get
B,C,0.373
E,A,0.5657
A,B,0.1747
C,D,0.585
I don't think this is the correct approach. Is there a way of looping through rows and not consider the row if a string is already present in either of two columns?
When you use awk
twice there is usually a better solution with only one awk
script. You can try:
shuf file.csv |
awk -F, '!(a[$1]+a[$2]) {a[$1]++; a[$2]++; n++; print} n==200 {exit}'
Variant using in
and next
:
shuf file.csv |
awk -F, '$1 in a || $2 in a {next} {a[$1]; a[$2]; n++; print} n==200 {exit}'
Note: your example has no input lines with identical first and second fields. But if this can happen and you want to skip these lines you can use:
shuf file.csv |
awk -F, '($1!=$2) && !(a[$1]+a[$2]) {a[$1]++; a[$2]++; n++; print}
n==200 {exit}'
Or:
shuf file.csv |
awk -F, '$1==$2 || $1 in a || $2 in a {next} {a[$1]; a[$2]; n++; print}
n==200 {exit}'