GATK: HaplotypceCaller IntelPairHmm only detecting 1 thread

I can't seem to get GATK to recognise the number of available threads. I am running GATK (4.2.4.1) in a conda environment which is part of a nextflow (v20.10.0) pipeline I'm writing. For whatever reason, I cannot get GATK to see there is more than one thread. I've tried different node types, increasing and decreasing the number of cpus available, providing java arguments such as -XX:ActiveProcessorCount=16, using taskset, but it always just detects 1.

Here is the command from the .command.sh:

gatk HaplotypeCaller \
  --tmp-dir tmp/ \
  -ERC GVCF \
  -R VectorBase-54_AgambiaePEST_Genome.fasta \
  -I AE12A_S24_BP.bam \
  -O AE12A_S24_BP.vcf

And here is the top of the .command.log file:

12:10:00.695 INFO  HaplotypeCaller - ------------------------------------------------------------
12:10:00.695 INFO  HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.2.4.1
12:10:00.695 INFO  HaplotypeCaller - For support and documentation go to https://software.broadinstitute.org/gatk/
12:10:00.696 INFO  HaplotypeCaller - Executing on Linux v4.18.0-193.6.3.el8_2.x86_64 amd64
12:10:00.696 INFO  HaplotypeCaller - Java runtime: OpenJDK 64-Bit Server VM v11.0.13+7-b1751.21
12:10:00.696 INFO  HaplotypeCaller - Start Date/Time: 9 February 2022 at 12:10:00 GMT
12:10:00.696 INFO  HaplotypeCaller - ------------------------------------------------------------
12:10:00.696 INFO  HaplotypeCaller - ------------------------------------------------------------
12:10:00.697 INFO  HaplotypeCaller - HTSJDK Version: 2.24.1
12:10:00.697 INFO  HaplotypeCaller - Picard Version: 2.25.4
12:10:00.697 INFO  HaplotypeCaller - Built for Spark Version: 2.4.5
12:10:00.697 INFO  HaplotypeCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2
12:10:00.697 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
12:10:00.697 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
12:10:00.697 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
12:10:00.697 INFO  HaplotypeCaller - Deflater: IntelDeflater
12:10:00.697 INFO  HaplotypeCaller - Inflater: IntelInflater
12:10:00.697 INFO  HaplotypeCaller - GCS max retries/reopens: 20
12:10:00.698 INFO  HaplotypeCaller - Requester pays: disabled
12:10:00.698 INFO  HaplotypeCaller - Initializing engine
12:10:01.126 INFO  HaplotypeCaller - Done initializing engine
12:10:01.129 INFO  HaplotypeCallerEngine - Tool is in reference confidence mode and the annotation, the following changes will be made to any specified annotations: 'StrandBiasBySample' will be enabled. 'ChromosomeCounts', 'FisherStrand', 'StrandOddsRatio' and 'QualByDepth' annotations have been disabled
12:10:01.143 INFO  HaplotypeCallerEngine - Standard Emitting and Calling confidence set to 0.0 for reference-model confidence output
12:10:01.143 INFO  HaplotypeCallerEngine - All sites annotated with PLs forced to true for reference-model confidence output
12:10:01.162 INFO  NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/home/anaconda3/envs/NF_GATK/share/gatk4-4.2.4.1-0/gatk-package-4.2.4.1-local.jar!/com/intel/gkl/native/libgkl_utils.so
12:10:01.169 INFO  NativeLibraryLoader - Loading libgkl_pairhmm_omp.so from jar:file:/home/anaconda3/envs/NF_GATK/share/gatk4-4.2.4.1-0/gatk-package-4.2.4.1-local.jar!/com/intel/gkl/native/libgkl_pairhmm_omp.so
12:10:01.209 INFO  IntelPairHmm - Flush-to-zero (FTZ) is enabled when running PairHMM
12:10:01.210 INFO  IntelPairHmm - Available threads: 1
12:10:01.210 INFO  IntelPairHmm - Requested threads: 4
12:10:01.210 WARN  IntelPairHmm - Using 1 available threads, but 4 were requested
12:10:01.210 INFO  PairHMM - Using the OpenMP multi-threaded AVX-accelerated native PairHMM implementation
12:10:01.271 INFO  ProgressMeter - Starting traversal

I found a thread on the broad institute website suggesting it might be the OMP library, but this is seemingly loaded, and I'm using the version they suggested updating to...

Needless to say, this is a little slow. I can always parallelise by using the -L option, but this doesn't solve that every step in the pipeline will be very slow.

Thanks in advance.

Solution

In case anyone else has the same problem, it turned out I had to configure the submission as an MPI job.

So on the HPC I use, here is the nextflow process:

process DNA_HCG {
  errorStrategy { sleep(Math.pow(2, task.attempt) * 600 as long); return 'retry' }
  maxRetries 3
  maxForks params.HCG_Forks

  tag { SampleID+"-"+chrom }

  executor = 'pbspro'
  clusterOptions = "-lselect=1:ncpus=${params.HCG_threads}:mem=${params.HCG_memory}gb:mpiprocs=1:ompthreads=${params.HCG_threads} -lwalltime=${params.HCG_walltime}:00:00"

  publishDir(
    path: "${params.HCDir}",
    mode: 'copy',
  )

  input:
  each chrom from chromosomes_ch
  set SampleID, path(bam), path(bai) from processed_bams
  path ref_genome
  path ref_dict
  path ref_index

  output:
  tuple chrom, path("${SampleID}_${chrom}.vcf") into HCG_ch
  path("${SampleID}_${chrom}.vcf.idx") into idx_ch
  
  beforeScript 'module load anaconda3/personal; source activate NF_GATK'

  script:
  """
  mkdir tmp
  n_slots=`expr ${params.GVCF_threads} / 2 - 3`
  if [ \$n_slots -le 0 ]; then n_slots=1; fi
  taskset -c 0-\${n_slots} gatk --java-options \"-Xmx${params.HCG_memory}G -XX:+UseParallelGC -XX:ParallelGCThreads=\${n_slots}\" HaplotypeCaller \\
    --tmp-dir tmp/ \\
    --pair-hmm-implementation AVX_LOGLESS_CACHING_OMP \\
    --native-pair-hmm-threads \${n_slots} \\
    -ERC GVCF \\
    -L ${chrom} \\
    -R ${ref_genome} \\
    -I ${bam} \\
    -O ${SampleID}_${chrom}.vcf ${params.GVCF_args}
  """
}