This bash script is meant to be part of a pipeline that processes zipped .vcf file that contain genomes from multiple patients (which means the files are huge even when zipped, like 3-5GB).
My problem is that I keep running out of memory when running this script. It is being run in a GCP high mem VM.
I am hoping there is a way to optimize the memory usage so that this doesn't fail. I looked into it but found nothing.
#!/bin/bash
for filename in ./*.vcf.gz; do
[ -e "$filename" ] || continue
name=${filename##*/}
base=${name%.vcf.gz}
bcftools query -l "$filename" >> ${base}_list.txt
for line in `cat ${base}_list.txt`; do
bcftools view -s "$line" "$filename" -o ${line}.vcf.gz
gzip ${line}.vcf
done
done
If you run out of memory when using bcftools query
/view
or gzip
look for options in the manual that might reduce the memory footprint. In case of gzip you might also switch to an alternative implementation. You could even consider switching the compression algorithm altogether (zstd is pretty good).
However, I have a feeling that the problem could be for line in `cat ${base}_list.txt`;
. The whole file ..._list.txt
is loaded into memory before the loop even starts. Also, reading lines that way has all kinds of problems, like splitting lines at whitespace, expanding globs like *
and so on. Use this instead:
while read -r line; do
bcftools view -s "$line" "$filename" -o "$line.vcf.gz"
gzip "$line.vcf"
done < "${base}_list.txt"
By the way: Are you sure you want bcftools query -l "$filename" >> ${base}_list.txt
to append. The file ${base}_list.txt
will keep growing each time the script is executed. Consider overwriting the file using >
instead of >>
.
However, in that case you might not need the file at all as you could use this instead:
bcftools query -l "$filename" |
while read -r line; do
bcftools view -s "$line" "$filename" -o "$line.vcf.gz"
gzip "$line.vcf"
done