google-cloud-platformnextflow

Nextflow process run in gcp stays idle forever after crashing - how to increase gcsfuse max opt?


I am running a nextflow pipeline in GCP ("batch"). This pipeline has multiple processes and the first half complete successfully.

However, when it reaches a specific process, it doesn't even start the command but rather stays idle indefinitely.

From within the VM, I tried to ls inside the two storage buckets that are mounted on it in /mnt/disks and required by my process. Note that these two buckets are required by the other processes too and they work in those.

However, in this particular process, one of the two buckets is inaccessible and I cannot ls inside, it gets stuck forever.

I am not sure if this ls issue is related with the process in general being stuck, but I have no idea where to start debugging it because there is no error coming out of it.

The pipeline works well locally.

Anyone has any idea on where I could start looking for answers?

EDIT #1: I figured why the ls is stuck

According to this post and this post, there may be a limit to the number of concurrent operations accepted by a bucket mounted with gcsfuse.

EDIT #2: What could be wrong with the command

I haven't posted the command yet because it is huge and mostly functional, but I finally narrowed it down to what it could be.

The command is structured like this:

        ${VEP} \
        --offline --cache --force_overwrite --show_ref_allele \
        --numbers \
        --fork ${task.cpus} \
        --refseq \
        --cache_version ${params.database.vep_cache_version} \
        --dir_cache database/${params.database.cache_VEP} \
        --fasta database/${params.database.fasta_VEP_gz} \
        --dir_plugins ${VEP_PLUGINS} \
        --assembly \${ASSEMBLY} \
        --custom database/${params.database.clinvar},ClinVar,vcf,exact,0,CLNSIG,CLNREVSTAT,CLNDN,CLNDISDB,CLNALLELEID \
        --custom database/${params.database.ExAC},exac03,vcf,exact,0,AF,AC,AC_Het,AC_Hom,AC_AFR,AC_AMR,AC_EAS,AC_FIN,AC_NFE,AC_OTH,AC_SAS,AN_AFR,AN_AMR,AN_EAS,AN_FIN,AN_NFE,AN_OTH,AN_SAS,Het_AFR,Het_AMR,Het_EAS,Het_FIN,Het_NFE,Het_OTH,Het_SAS,Hom_AFR,Hom_AMR,Hom_EAS,Hom_FIN,Hom_NFE,Hom_OTH,Hom_SAS \
        --custom database/${params.database.gnomad_exomes},gnomad_exomes,vcf,exact,0,AF,AF_afr,AF_amr,AF_asj,AF_eas,AF_fin,AF_nfe,AF_oth,AF_sas,AF_XY,AF_XX,popmax \
        --custom database/${params.database.gnomad_genomes},gnomad_genomes,vcf,exact,0,AF,AF_afr,AF_amr,AF_asj,AF_eas,AF_fin,AF_nfe,AF_oth,AF_sas,AF_XY,AF_XX,popmax \
        --custom database/${params.database.ESP6500SI},esp6500siv2,vcf,exact,0,MAF \
        --af_1kg \
        --hgvs --hgvsg --symbol --nearest symbol --distance 1000 \
        --canonical --exclude_predicted \
        --regulatory \
        -i ${norm_vcf} \
        -o ${sample_id}_VEP_output.raw.vcf \
        -vcf --flag_pick_allele_gene \
        --pick_order mane,canonical,appris,tsl,biotype,ccds,rank,length

Where you can see that there are 5 --custom databases. This command with 5 --custom annotations worked.

However, in my original command I pass 9 custom databases and 9 custom plugins, which are processed concurrently by the program (VEP).

Locally, this 9+9 configuration works, but in gcp it doesn't. I am beginning to think that it may be due to gcsfuse itself. However I have no idea of how to debug this or control this from nextflow.

Anyone has had experiences before with a situation like this?


Solution

  • In the end, reducing the number of simultaneous connections to the mounted bucket is what did the trick. Although the node may have any number of CPUs, I found that when passing more than 10 CPUs as threads (--fork, in VEP terms) creates too many accesses to the fused container and leads to crashes.

    Not sure how to scientifically reproduce this error but it happens nearly every time with > 10 CPUs, and never with < 10 CPUs. There may be a hard limit somewhere in the gcsfuse command that nextflow executes automatically (I don't have control over it).