bioinformaticsvcftoolsbcftoolshailvcf-variant-call-format

Combine multiple VCF files into one large VCF file


I have a list of VCF files from specific ethnicity such as American Indian, Chinese, European, etc

Under each ethnicity, I have around 100+ files.

Currently, I computed the VARIANT QC metrics such as call_rate, n_het etc for one file as shown in the hail tutorial (refer image below)

image is here

However, now I would like to have one file for each ethnicity and then compute VARIANT_QC metrics.

I already referred to this post and this post but don't think this addresses my query

How can I do this across all files under a specific ethnicity?

Can help me with this?

Is there any hail/python/R/other tools way to do this?


Solution

  • You could use Variant Transforms to achieve this goal. Variant Transforms is a tool for parsing and importing VCF files into BigQuery. It also can perform the reverse transform: export variants stored in BigQuery tables to VCF file. So basically you need to:  multiple VCF files -> BigQuery -> Single VCF file

    Variant Transforms can easily handle multiple input files. It also can perform more complex logic to merge same variants across multiple files into the same record. After your variants are all loaded into BigQuery you could export them to VCF file.

    Note that Variant Transforms creates a separate table for each chromosome to optimize query costs. You can easily create a VCF file for each chromosome and then merge them together to create a single one.

    You can reach out to Variant Transforms team if you need help with this task.