I have a working solution to my problem but it is slow. I am curious about the recommended approach to speeding it up and to see how fast it can be made. Here is an example input file
CHROM POS REF ALT Geno value
Chr16 616504 T C X93.968 0|1:7,28:35:99:0|1:616504_T_C:787,0,177:616504
Chr16 616504 T C BESC.1 0/0:48,0:48:99:.:.:0,114,1710:.
Chr16 616504 T C BESC.10 1|1:0,23:23:72:1|1:616504_T_C:1059,72,0:616504
Chr16 616504 T C BESC.100 0/0:34,0:34:96:.:.:0,96,1440:.
Chr16 616504 T C BESC.1001 0/0:47,0:47:99:.:.:0,120,1800:.
Chr16 616504 T C BESC.1002 0/0:39,0:39:99:.:.:0,108,948:.
The goal is to take the 1st and 3rd character from the value
column and sum them, then output a similar file with the value column replaced with this sum. Example output for first two lines:
CHROM POS REF ALT Geno value
Chr16 616504 T C X93.968 1
Chr16 616504 T C BESC.1 0
Here is my current solution where STDIN 1 is the input file name and STDIN 2 is the output file name:
#!/bin/bash
i=0
len=$(cat $1 | wc -l)
touch $2
while read -r line; do
let "i++"
geno=$(echo "$line" | awk '{ print $6 }'| cut -c1,3 | fold -w1 | paste -sd+ - | bc)
echo "$line" | awk -v g="$geno" '{ $6=""; print $0 " " g}' >> $2
echo "Processed " "$i/$len"
done < $1
The actual input file has 1,707,993 entries. With my solution this takes about 4.5 hrs to compute. Ideally I could run this in under an hour but I'm not sure how realistic that is. Thanks!
This loop is going to be slow because it's creating and destroying 6 external processes per line of your input file.
In Bash, it's generally much faster to invoke a command to handle an entire file than it is to invoke the command once per line in that file.
For example, if you wanted to do the file processing in Awk, you could do it like this:
#!/bin/bash
set -eu
awk -v out_file="$2" 'NR == 1 {
print > out_file;
}
NR != 1 {
$6 = substr($6, 1, 1) + substr($6, 3, 1);
print > out_file;
print "Processed " NR;
}' "$1"
Explanation:
print
will emit the original output with the one field changed.out_file
.