mergesnakemakegatkpicard

Error when Running Picard or GATK MergeVcfs Using Snakemake


I am new to Snakemake. I tried to build a small pipeline to merge two VCF files but I keep getting an error. Did anyone encounter this problem before ?

This is my original code for merge two VCF files:

ID = [line.strip() for line in open("0.rawdata_fastq/sample-pID.txt", "r")]


from datetime import datetime
date = datetime.now().strftime("%Y%m%d_%H%M%S")

import tempfile
from snakemake.shell import shell
from snakemake_wrapper_utils.java import get_java_opts



rule all_variant_filtering:
    input:
        expand("8.variant_filter/1.VariantFiltration/{pID}_VariantFiltration/{pID}.filtered_SNP.vcf.gz", pID=ID),
        expand("8.variant_filter/1.VariantFiltration/{pID}_VariantFiltration/{pID}.filtered_InDel.vcf.gz", pID=ID),
        expand("8.variant_filter/2.SelectVariants/{pID}_SelectVariants/{pID}.selected_SNP.vcf.gz", pID=ID),
        expand("8.variant_filter/2.SelectVariants/{pID}_SelectVariants/{pID}.selected_InDel.vcf.gz", pID=ID),
        expand("8.variant_filter/3.MergeVcfs/{pID}_MergeVcfs/{pID}.vronaly.vcf.gz", pID=ID)



# Step 8-1 : SNPs VariantFiltration
rule snp_filter:
    input:
        vcf="7.variant_calling/{pID}_variant_calling/{pID}.vcf.gz",
        ref="/mnt/mca/reference_data/ncbi_reference/reference_genome/UCSC_id/GRCh38_latest_genomic.fa",
        gatk="/home/mca-gene3/Tools/gatk-4.5.0.0/gatk"
    output:
        vcf_snp="8.variant_filter/1.VariantFiltration/{pID}_VariantFiltration/{pID}.filtered_SNP.vcf.gz"
    log:
        vcf_snp_log="8.variant_filter/1.VariantFiltration/{pID}_VariantFiltration/{pID}-VariantFiltration.SNP." + date +".log"
    message:
        "\nStep 8-1: VariantFiltration: Filter variant calls based on INFO and/or FORMAT annotations (SNP)"
    shell:
        """
        {input.gatk} VariantFiltration \
        --reference {input.ref} \
        --variant {input.vcf} \
        --filter-expression "QD < 2.0" --filter-name "QD" \
        --filter-expression "QUAL < 30.0" --filter-name "QUAL30" \
        --filter-expression "SOR > 3.0" --filter-name "SOR" \
        --filter-expression "FS > 60.0" --filter-name "FS" \
        --filter-expression "MQ < 40.0" --filter-name "MQ" \
        --filter-expression "MQRankSum < -12.5" --filter-name "MQRankSum" \
        --filter-expression "ReadPosRankSum < -8.0" --filter-name "RPRS" \
        --output {output.vcf_snp} 2>&1 | tee -a {log.vcf_snp_log}
        """



# Step 8-2 : InDel VariantFiltration
rule indel_filter:
    input:
        vcf="7.variant_calling/{pID}_variant_calling/{pID}.vcf.gz",
        ref="/mnt/mca/reference_data/ncbi_reference/reference_genome/UCSC_id/GRCh38_latest_genomic.fa",
        gatk="/home/mca-gene3/Tools/gatk-4.5.0.0/gatk"
    output:
        vcf_indel="8.variant_filter/1.VariantFiltration/{pID}_VariantFiltration/{pID}.filtered_InDel.vcf.gz"
    log:
        vcf_indel_log="8.variant_filter/1.VariantFiltration/{pID}_VariantFiltration/{pID}-VariantFiltration.InDel." + date +".log"
    message:
        "\nStep 8-2: VariantFiltration: Filter variant calls based on INFO and/or FORMAT annotations (InDel)"
    shell:
        """
        {input.gatk} VariantFiltration \
        --reference {input.ref} \
        --variant {input.vcf} \
        --filter-expression "QD < 2.0" --filter-name "QD" \
        --filter-expression "SOR > 10.0" --filter-name "SOR" \
        --filter-expression "FS > 200.0" --filter-name "FS" \
        --filter-expression "ReadPosRankSum < -20.0" --filter-name "RPRS" \
        --output {output.vcf_indel} 2>&1 | tee -a {log.vcf_indel_log}
        """



# Step 8-3 : SNPs SelectVariants
rule snp_select:
    input:
        snp_vcf="8.variant_filter/1.VariantFiltration/{pID}_VariantFiltration/{pID}.filtered_SNP.vcf.gz",
        ref="/mnt/mca/reference_data/ncbi_reference/reference_genome/UCSC_id/GRCh38_latest_genomic.fa",
        gatk="/home/mca-gene3/Tools/gatk-4.5.0.0/gatk"
    output:
        vcf_snp_filter="8.variant_filter/2.SelectVariants/{pID}_SelectVariants/{pID}.selected_SNP.vcf.gz"
    log:
        vcf_snp_filter_log="8.variant_filter/2.SelectVariants/{pID}_SelectVariants/{pID}-SelectVariants.SNP." + date +".log"
    message:
        "\nStep 8-3 : SelectVariants: Select a subset of variants from a VCF file (SNP)"
    shell:
        """
        {input.gatk} SelectVariants \
        --reference {input.ref} \
        --variant {input.snp_vcf} \
        --output {output.vcf_snp_filter} \
        --select-type-to-include SNP --select-type-to-include MNP 2>&1 | tee -a {log.vcf_snp_filter_log}
        """



# Step 8-4 : InDels SelectVariants
rule indel_select:
    input:
        indel_vcf="8.variant_filter/1.VariantFiltration/{pID}_VariantFiltration/{pID}.filtered_InDel.vcf.gz",
        ref="/mnt/mca/reference_data/ncbi_reference/reference_genome/UCSC_id/GRCh38_latest_genomic.fa",
        gatk="/home/mca-gene3/Tools/gatk-4.5.0.0/gatk"
    output:
        vcf_indel_filter="8.variant_filter/2.SelectVariants/{pID}_SelectVariants/{pID}.selected_InDel.vcf.gz"
    log:
        vcf_indel_filter_log="8.variant_filter/2.SelectVariants/{pID}_SelectVariants/{pID}-SelectVariants.InDel." + date +".log"
    message:
        "\nStep 8-4 : SelectVariants: Select a subset of variants from a VCF file (InDel)"
    shell:      
        """
        {input.gatk} SelectVariants \
        --reference {input.ref} \
        --variant {input.indel_vcf} \
        --output {output.vcf_indel_filter} \
        --select-type-to-include INDEL --select-type-to-include MIXED 2>&1 | tee -a {log.vcf_indel_filter_log}
        """

# Step 8-5 : Merge selected.SNP and selected.InDel to a VCF.gz file
rule mergr_to_vcf:  
    input:     
        VCF_SNP="8.variant_filter/2.SelectVariants/{pID}_SelectVariants/{pID}.selected_SNP.vcf.gz", 
        VCF_INDEL="8.variant_filter/2.SelectVariants/{pID}_SelectVariants/{pID}.selected_InDel.vcf.gz"
    output:  
        VCF_MERGE="8.variant_filter/3.MergeVcfs/{pID}_MergeVcfs/{pID}.vronaly.vcf.gz"
    log:
        merge_log="8.variant_filter/3.MergeVcfs/{pID}_MergeVcfs/{pID}-MergeVcfs." + date +".log"
    message:
        "\nStep 8-5 : Merge to a VCF file."
    shell:
        "gatk MergeVcfs -I {input.VCF_SNP} -I {input.VCF_INDEL} -O {output.VCF_MERGE} 2>&1 | tee -a {log.merge_log}"

Everything ran successfully until step 8-5, and the error message is

IndentationError in file <string>, line 11:
unindent does not match any outer indentation level:
        VCF_INDEL="/home/fanny/Documents/SRR1695644.selected_InDel.vcf"
  File "/home/fanny/miniforge3/envs/snakemake/lib/python3.12/tokenize.py", line 541, in _generate_tokens_from_c_tokenizer
  File "/home/fanny/miniforge3/envs/snakemake/lib/python3.12/tokenize.py", line 537, in _generate_tokens_from_c_tokenizer

I don't understand why there are errors in tokenize.py because, as far as I know, it is a default python script.

So I try another two ways to solve the problem. One was changing gatk to picard in shell, but got same error. Another was using wrapper:"v3.9.0/bio/picard/mergevcfs", and I got new error message:

Assuming unrestricted shared filesystem usage.
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 4
Rules claiming more threads will be scaled down.
Conda environments: ignored
Job stats:
job           count
----------  -------
merge_vcfs        1
total             1

Select jobs to execute...
Execute 1 jobs...

[Tue May 14 09:50:23 2024]
Job 0: 
Step 8-5 : Merge to a VCF file.
Reason: Missing output files: /home/fanny/Documents/SRR1695644.vronaly.vcf.gz

Traceback (most recent call last):
  File "/home/fanny/Documents/.snakemake/scripts/tmpydivs5px.wrapper.py", line 24, in <module>
    shell(
  File "/home/fanny/miniforge3/envs/snakemake/lib/python3.12/site-packages/snakemake/shell.py", line 297, in __new__
    raise sp.CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'set -euo pipefail;  picard MergeVcfs  -Xmx164M -Djava.io.tmpdir=/tmp  --INPUT /home/fanny/Documents/SRR1695644.selected_SNP.vcf --INPUT /home/fanny/Documents/SRR1695644.selected_InDel.vcf --TMP_DIR /tmp/tmpkkyq14g1 --OUTPUT /home/fanny/Documents/SRR1695644.vronaly.vcf.gz  2> /home/fanny/Documents/SRR1695644-MergeVcfs.log' returned non-zero exit status 127.
RuleException:
CalledProcessError in file /home/fanny/Documents/test.smk, line 22:
Command 'set -euo pipefail;  /home/fanny/miniforge3/envs/snakemake/bin/python3.12 /home/fanny/Documents/.snakemake/scripts/tmpydivs5px.wrapper.py' returned non-zero exit status 1.
[Tue May 14 09:50:25 2024]
Error in rule merge_vcfs:
    jobid: 0
    input: /home/fanny/Documents/SRR1695644.selected_SNP.vcf, /home/fanny/Documents/SRR1695644.selected_InDel.vcf
    output: /home/fanny/Documents/SRR1695644.vronaly.vcf.gz
    log: /home/fanny/Documents/SRR1695644-MergeVcfs.log (check log file(s) for error details)
    conda-env: /home/fanny/Documents/.snakemake/conda/bf54fead3e98a647607f11e9b2421b3c_

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2024-05-14T095023.353231.snakemake.log
WorkflowError:
At least one job did not complete successfully.

Can anyone help me with this issue ?

Thank you so much !


Solution

  • The error: non-zero exit status 127 typically means a command was not found. Check that picard is a valid command in your $PATH and not, say, a shell alias.

    Easiest way to check - in your shell run type picard.

    Regarding the IndentationError, it's normal for these type of syntax error messages to reference tokenize.py even though the problem is in your Snakefile and tokenize.py is just the that has detected the problem. This is something that should be fixed in Snakemake, probably.