I am using bcftools to merge two vcf files. File 1 is 590MB with 4,732,099 unique variants. File 2 is 704MB with 1,774,673 unique variants. The files have some overlapping variants and are from entirely different samples. Ideally I want to merge them as they are without taking an intersection.
When I simply merge these files with bcftools merge -o output.vcf file1.vcf.gz file2.vcf.gz
it works.
When I left align and split these variants though prior to the merge I get a rather cryptic Segmentation fault: 11
.
The left alignment, normalisation and splitting is achieved with:
bcftools norm -m -both -f reference.fa.gz file1.vcf.gz -o file1.norm.vcf.gz
The output for each file looks like:
Lines total/split/realigned/skipped: 4732099/310967/86666/0
Lines total/split/realigned/skipped: 1774673/105052/119007/0
I see that a bug was reported previously in bcftools which resulted in a segmentation fault associated with working with 1000s of files. To investigate whether the file size was important I tried to merge file1.vcf with a large 1000 genomes vcf file. This fails with the same segmentation fault so I wonder if this is the issue. I'm not sure how I would go about working out which part of the system setup is insufficient.
If I run the same on a jupyter notebook I get a little more detail:
bash: line 1: 14238 Segmentation fault: 11 bcftools merge broad.norm.vcf.gz decode.norm.vcf.gz > case.vcf.gz
---------------------------------------------------------------------------
CalledProcessError Traceback (most recent call last)
Cell In[28], line 1
----> 1 get_ipython().run_cell_magic('bash', '', 'bcftools merge broad.norm.vcf.gz decode.norm.vcf.gz > case.vcf.gz\n')
File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/IPython/core/interactiveshell.py:2478, in InteractiveShell.run_cell_magic(self, magic_name, line, cell)
2476 with self.builtin_trap:
2477 args = (magic_arg_s, cell)
-> 2478 result = fn(*args, **kwargs)
2480 # The code below prevents the output from being displayed
2481 # when using magics with decodator @output_can_be_silenced
2482 # when the last Python token in the expression is a ';'.
2483 if getattr(fn, magic.MAGIC_OUTPUT_CAN_BE_SILENCED, False):
File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/IPython/core/magics/script.py:154, in ScriptMagics._make_script_magic.<locals>.named_script_magic(line, cell)
152 else:
153 line = script
--> 154 return self.shebang(line, cell)
File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/IPython/core/magics/script.py:314, in ScriptMagics.shebang(self, line, cell)
309 if args.raise_error and p.returncode != 0:
310 # If we get here and p.returncode is still None, we must have
311 # killed it but not yet seen its return code. We don't wait for it,
312 # in case it's stuck in uninterruptible sleep. -9 = SIGKILL
313 rc = p.returncode or -9
--> 314 raise CalledProcessError(rc, cell)
CalledProcessError: Command 'b'bcftools merge broad.norm.vcf.gz decode.norm.vcf.gz > case.vcf.gz\n'' returned non-zero exit status 139.
Potentially in keeping with it being a size based issue if I take only the first 1000 normalised variants they merge fine (though I guess there could be odd variants later in the file that are causing the issues).
I'm working on an M2 mac with 32GB unified memory.
ulimit -a
:
ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
file size (blocks, -f) unlimited
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 256
pipe size (512 bytes, -p) 1
stack size (kbytes, -s) 8176
cpu time (seconds, -t) unlimited
max user processes (-u) 5333
virtual memory (kbytes, -v) unlimited
Can anyone provide advice on how I could troubleshoot further and hopefully merge these vcf files?
Thanks, Angus
Build the bcftools
programs from sources, as close to the version that you are using. Then try to reproduce the problem, namely that this crashes:
bcftools merge broad.norm.vcf.gz decode.norm.vcf.gz
If it reproduces, then try it with the latest code instead, in case it's something that was fixed.
Otherwise, drill into it, for instance with the GNU debugger:
gdb --args ./bcftools merge broad.norm.vcf.gz decode.norm.vcf.gz
The tool works with compressed and uncompressed files of the files. If the problem reproduces with only the .gz
files but not plain files, it's something with the decompression.
According to the issues database on GitHub, a crash was fixed in the merge feature just two days ago: