I am new to nextflow and am trying to build a pipeline. I am having trouble with passing outputs into new processes, properly capturing outputs, and with BWA-MEM2 index designation.
Here is my script:
#!/usr/bin/env nextflow
// Declare syntax version
nextflow.enable.dsl = 2
// Set parameters
params.fastq = "$projectDir/fastq/*_{1,2}.f*"
params.refgenome = "$projectDir/RefGenome/hg38.fa"
workflow {
read_pairs_ch = Channel.fromFilePairs(params.fastq, checkIfExists: true)
trimAndQC(read_pairs_ch)
readMapping(params.refgenome, trimAndQC.out.trimmedreads)
}
process trimAndQC {
debug true
tag "$sample_id"
publishDir 'trimmed/', mode: 'copy', overwrite: false
input:
tuple val(sample_id), path(fastq)
output:
tuple val('sample_id'), path('*_val_1.fq.gz'), path('*_val_2.fq.gz'), emit: trimmedreads
shell:
"""
trim_galore --paired --fastqc ${fastq.get(0)} ${fastq.get(1)}
"""
}
// BWA read mapping
process readMapping {
debug true
tag "$sample_id"
publishDir 'bwa_aligned/', mode: 'copy', overwrite: false
input:
path(refgenome)
tuple val(sample_id), val(trimmedreads)
output:
path('${sample_id}_sorted.bam'), emit: sorted_bam
script:
def indexbase = refgenome.baseName
"""
bwa-mem2 mem -t 2 '${indexbase}' '${trimmedreads}' | samtools sort -@2 -o '${sample_id}_sorted.bam' -
"""
}
First, I am getting this warning:
WARN: Input tuple does not match input set cardinality declared by process `readMapping` -- offending value: [sample_id, /Users/me/Desktop/fastq/work/27/448a89259a9435ff86d52009afa05d/SRR925811_1_val_1.fq.gz, /Users/me/Desktop/fastq/work/27/448a89259a9435ff86d52009afa05d/SRR925811_2_val_2.fq.gz]
Which is confusing as my output is a tuple and thus should match the cardinality of the output of trimAndQC
. As an aside, I have noticed most scripts passing the input tuple (in this case path(fastq)
) as a single argument, but to get the command to work I have to use the .get()
accession method to 'unpack' the tuple, so I am unsure what I am doing wrong there.
Second, I am getting this error in regards to the output of readMapping
process:
ERROR ~ Error executing process > 'readMapping (sample_id)'
Caused by:
Missing output file(s) `${sample_id}_sorted.bam` expected by process `readMapping (sample_id)`
Command executed:
bwa-mem2 mem -t 2 'hg38' '/Users/me/Desktop/fastq/work/27/448a89259a9435ff86d52009afa05d/SRR925811_1_val_1.fq.gz' | samtools sort -@2 -o sample_id_sorted.bam -
Command exit status:
0
Command output:
(empty)
I assume this may be due to the cardinality warning above and sample_id
is empty.
Finally, I cannot seem to designate the index properly to the BWA-MEM2 command. The error:
Command error:
Looking to launch executable "/Users/me/opt/anaconda3/envs/nextflow/bin/bwa-mem2.sse42", simd = .sse42
Launching executable "/Users/me/opt/anaconda3/envs/nextflow/bin/bwa-mem2.sse42"
-----------------------------
Executing in SSE4.2 mode!!
-----------------------------
* SA compression enabled with xfactor: 8
* Ref file: hg38
* Entering FMI_search
ERROR! Unable to open the file: hg38.bwt.2bit.64
Work dir:
/Users/me/Desktop/fastq/work/23/f8b3b6eace6f11f8619ce435b16725
Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`
-- Check '.nextflow.log' file for details
In my RefGenome directory I have hg38.fa indexed on an AWS EC2 and downloaded to locally to prototype the pipeline. The files look have these extensions:
hg38.fa.{0123,amb,ann,pac,bwt.2bit.64}
and appear to work fine if I just run the bwa-mem2 mem
command in my terminal.
I have tried to adapt multiple SO answers (e.g., how to chain Channel.fromFilePairs from one process to another process, nextflow: how to pass directory path with files created in process before), I have gone through the EMBL training (Nextflow), looked through pipelines on nf-core and github, and have spent time with nextflow documentation prior to asking this question. I seem to be misunderstanding some inherent principle of nextflow so any advice on what to focus would be appreciated.
Finally, for clarity and reproducibility I am using the following fastqs from NCBI SRA (SRX317818) for prototyping. These are the versions of nextflow, trim-galore, and bwa-mem2 I am using all via an Anconda environment:
bwa-mem2 2.2.1 hdf58011_5 bioconda
nextflow 23.10.1 hdfd78af_0 bioconda
trim-galore 0.6.10 hdfd78af_0 bioconda
I have crossposted this question on Biostars (Nextflow - I/O and index issues). Based on this post, it appears crossposting is tolerated to a degree (Is cross-posting a question...). I can remove the post if necessary.
Thanks to @dthorbur at Biostars for his help with this problem.
I have finally completely resolved all my problems and want to document here in case anyone else runs into such issues in the future.
First problem was issues with cardinality. As @dthorbur pointed out, my issue was not keeping the same tuple structure between processes. The output from the upstream process:
output:
tuple val('sample_id'), path('*_val_1.fq.gz'), path('*_val_2.fq.gz')
Whereas the input into the downstream process:
input:
tuple val(sample_id), val(trimmedreads)
Resulting in an easy fix:
input:
tuple val(sample_id), path(read1), path(read2)
Next issue I was having was ingesting the index for BWA into my working environment. I found the most straightforward and simple method was to designate the index directory and fasta file separately like this:
// Set paths for index directory and the reference fasta
params.index_dir = "$projectDir/RefGenome/BWAIndex"
params.ref = "hg38.fa"
------------------------
...
// Designate the index within the processes like this
script:
"""
bwa mem -t 2 ${index_dir}/${ref} ${read1} ${read2} | samtools sort -@2 -o "${sample_id}_sorted.bam" -
"""
------------------------
// Define the channels and pass them to the process
index_ch = Channel.fromPath(params.index_dir)
ref_ch = Channel.of(params.ref)
...
bwa_align(index_ch, ref_ch, trim_QC.out.trimmed_reads)
My final problem was getting the bwa_align process to run properly. My original error was related to no output file, which I realized was an issue with single quotes around the output file and within the script block:
output:
path('${sample_id}_sorted.bam')
which Nextflow wasn't interpreting properly. Once I switched to double quotes all was fine. So I am tempted to strictly use double quotes when writing Nextflow scripts.
Here is my final script that finally worked:
#!/usr/bin/env nextflow
nextflow.enable.dsl = 2
// Define parameters
params.fastq = "$projectDir/fastq/*_{1,2}*"
params.index_dir = "$projectDir/RefGenome/BWAIndex"
params.ref = "hg38.fa"
workflow {
fastq_ch = Channel.fromFilePairs(params.fastq)
index_ch = Channel.fromPath(params.index_dir)
ref_ch = Channel.of(params.ref)
trim_QC(fastq_ch)
bwa_align(index_ch, ref_ch, trim_QC.out.trimmed_reads)
}
process trim_QC {
debug true
tag "$sample_id"
publishDir 'trimmed/', mode: 'copy', overwrite: false
input:
tuple val(sample_id), path(fastq)
output:
tuple val("${sample_id}"), path("*_val_1.fq.gz"), path("*_val_2.fq.gz"), emit: trimmed_reads
script:
def (read1, read2) = fastq
"""
trim_galore --paired --fastqc ${read1} ${read2}
"""
}
process bwa_align {
debug true
tag "$sample_id"
publishDir 'bwa_aligned/', mode: 'copy', overwrite: false
input:
path index_dir
val ref
tuple val(sample_id), path(read1), path(read2)
output:
path("${sample_id}_sorted.bam"), emit: sorted_bam
script:
"""
bwa mem -t 2 ${index_dir}/${ref} ${read1} ${read2} | samtools sort -@2 o "${sample_id}_sorted.bam" -
"""
}