bashawkbioinformaticsvcf-variant-call-format

Optimal Solution for Integer Encoding VCF file


I have a working solution to my problem but it is slow. I am curious about the recommended approach to speeding it up and to see how fast it can be made. Here is an example input file

CHROM    POS REF   ALT   Geno      value                                                                                       
Chr16 616504 T     C     X93.968   0|1:7,28:35:99:0|1:616504_T_C:787,0,177:616504   
Chr16 616504 T     C     BESC.1    0/0:48,0:48:99:.:.:0,114,1710:.                  
Chr16 616504 T     C     BESC.10   1|1:0,23:23:72:1|1:616504_T_C:1059,72,0:616504   
Chr16 616504 T     C     BESC.100  0/0:34,0:34:96:.:.:0,96,1440:.                   
Chr16 616504 T     C     BESC.1001 0/0:47,0:47:99:.:.:0,120,1800:.                  
Chr16 616504 T     C     BESC.1002 0/0:39,0:39:99:.:.:0,108,948:.    

           

The goal is to take the 1st and 3rd character from the value column and sum them, then output a similar file with the value column replaced with this sum. Example output for first two lines:

CHROM    POS REF   ALT   Geno      value                                                                                       
Chr16 616504 T     C     X93.968   1   
Chr16 616504 T     C     BESC.1    0   

Here is my current solution where STDIN 1 is the input file name and STDIN 2 is the output file name:

#!/bin/bash
i=0
len=$(cat $1 | wc -l)

touch $2
while read -r line; do
    let "i++"
    geno=$(echo "$line" | awk '{ print $6 }'| cut -c1,3 | fold -w1 | paste -sd+ - | bc)

    echo "$line" | awk -v g="$geno" '{ $6=""; print $0 " " g}' >> $2

    echo "Processed " "$i/$len"

done < $1

The actual input file has 1,707,993 entries. With my solution this takes about 4.5 hrs to compute. Ideally I could run this in under an hour but I'm not sure how realistic that is. Thanks!


Solution

  • This loop is going to be slow because it's creating and destroying 6 external processes per line of your input file.

    In Bash, it's generally much faster to invoke a command to handle an entire file than it is to invoke the command once per line in that file.

    For example, if you wanted to do the file processing in Awk, you could do it like this:

    #!/bin/bash
    set -eu
    
    awk -v out_file="$2" 'NR == 1 {
        print > out_file;
    }
    NR != 1 {
        $6 = substr($6, 1, 1) +  substr($6, 3, 1);
        print > out_file;
        print "Processed " NR;
    }' "$1"
    

    Explanation:

    1. It takes the 6th field, and takes the first and third characters and sums them. Then it re-assigns them back to field 6. In awk this means that a print will emit the original output with the one field changed.
    2. The print is redirected to out_file.
    3. Emit a progress indicator.