regexawksnakemakegtfsgff

Bash code with if/else/fi and awk/perl regexps does work outside snakemake but not inside a snakemake rule


I am constructing a snakemake workflow that is working well for the most part. One of the steps requires a ready.gtf file containing certain fields of information in its last column, specifically gene_name and transcript_name.

I have thought of some bash code that can help the pipeline detect if this information is already present, and decide what to do based on this. In case the field 9 contains that information (i.e. some string in the field 9 matches the regexp (gene|transcript)_name ), then nothing needs to be done to the gtf (except making sure there is no = signs in the file) and it just replaces the = signs for spaces ' ' and outputs this to a ready.gtf file. If this information is not present (i.e. there is no match to the (gene|transcript)_name in the field9), then it parses column 9 using perl regexps and, using paste, it appends this to the input gtf and creates a ready.gtf. For clarity, the expression /.*?(gene|transcript)_id=([^;]+).*?/ aims at identifying and capture anything that falls within the gene_id or trancript_id field(s) (capture group 1), store the corresponding value (which can be any character except the field separator ";") in a capture group (capture group 2), and later "dump" it using /;\1_name=\2;/ (flanking semicolons are intended). Thus effectively detecting and changing the gene_id and transcript_id for gene_name and transcript name. Appending them at the end is also intended, since the ready.gtf needs BOTH the gene/transcript_id and the gene/transcript_name fields.

The bash code looks like below. I am able to run this well outside snakemake, meaning that it detects gtf files that have and do not have the required info on field9, and it decides accordingly on what to do.

field9=`awk ' $3 == "transcript" ' metadata/parsed.gtf | head -n1 | cut -f9`

pattern='(gene|transcript)_name'

if [[ $field9 =~ $pattern ]]; then
    echo "Detected gene/transcript_name in field9, no need for preparing the gtf"
    perl -pe "s/=/ /g" metadata/parsed.gtf > metadata/ready.gtf

else
    echo "gene_name and/or transcript_name not detected in field9"
    cut -f9 metadata/parsed.gtf | perl -pe 's/.*?(gene|transcript)_id=([^;]+).*?/;\1_name=\2;/g' | perl -pe 's/;;+/;/g' > lane9
    paste -d ' ' metadata/parsed.gtf lane9 | perl -pe "s/=/ /g" > metadata/ready.gtf

And when inside the snakemake rule:

rule preparegtf:
    input:"metadata/{organism}_parsed.gtf"
    output:"metadata/{organism}_ready.gtf"
    params:
        awkexpr = r""" $3 == "transcript" """,
        perlexpr1 = r"""s/.*?(gene|transcript)_id=([^;]+).*?/;\1_name=\2;/g""",
        perlexpr2 = r"""s/;;+/;/g"""
    shell:
        '''
        field9=`awk {params.awkexpr:q} {input} | head -n1 | cut -f9`
        pattern='(gene|transcript)_name'
        if [[ $field9 =~ $pattern ]]; then
            echo "Detected gene_name and/or transcript_name in field9, no need for preparing the gtf"
            perl -pe "s/=/ /g" {input} > {output}
        else
            echo "gene_name and/or transcript_name not detected in field9"
            cut -f9 {input} | \
            perl -pe {params.perlexpr1:q} | perl -pe {params.perlexpr2:q} > lane9
            paste -d ' ' {input} lane9 | perl -pe "s/=/ /g" > {output}
        fi
        '''

I have made sure that the workflow works well regardless of this rule. When I provide the ready.gtf in the expected path, snakemake is able to see it and it does not run the rule above, instead following up with the next ones, and succesfully finishing the pipeline. Likewise the --dry-run works well. Hence I do NOT get any InputNotFound or OutputNotFound errors. Note I also took care, to the best of my knowledge, of having the regexps in the r"""REGEXP""" format to help proper quotation. I am not sure the error might be coming from here.

The snakemake run fails with an undisclosed error provided it runs on strict mode.

Below the log of snakemake:

(snakemake_venv) alberto@nostromo:~/projects/dev/pipe/scbe_pipe$ snakemake --cores 8 --use-conda --rerun-incomplete
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 8
Rules claiming more threads will be scaled down.
Job stats:
job                     count    min threads    max threads
--------------------  -------  -------------  -------------
all                         1              1              1
collect_metadata            1              1              1
digitalexpression           1              1              1
finalstep                   1              1              1
makeintervals               1              1              1
makerefflat                 1              1              1
mergebamalignment           1              1              1
preparegtf                  1              1              1
reducegtf                   1              1              1
sortsam                     1              1              1
star                        1              1              1
starindex                   1              1              1
tagbam_genefunction         1              1              1
tagbam_geneintervals        1              1              1
total                      14              1              1

Select jobs to execute...

[Wed Mar  8 11:14:19 2023]
rule preparegtf:
    input: metadata/testorganism_parsed.gtf
    output: metadata/testorganism_ready.gtf
    jobid: 27
    wildcards: organism=testorganism
    resources: tmpdir=~/tmp

[Wed Mar  8 11:14:19 2023]
Error in rule preparegtf:
    jobid: 27
    output: metadata/testorganism_ready.gtf
    shell:
        
        field9=`awk ' $3 == "transcript" ' metadata/testorganism_parsed.gtf | head -n1 | cut -f9`
        pattern='(gene|transcript)_name'
        if [[ $field9 =~ $pattern ]]; then
            echo "Detected gene_name and/or transcript_name in field9, no need for preparing the gtf"
            perl -pe "s/=/ /g" metadata/testorganism_parsed.gtf > metadata/testorganism_ready.gtf
        else
            echo "gene_name and/or transcript_name not detected in field9"
            cut -f9 metadata/testorganism_parsed.gtf |             perl -pe 's/.*?(gene|transcript)_id=([^;]+).*?/;\1_name=\2;/g' | perl -pe 's/;;+/;/g' > lane9
            paste -d ' ' metadata/testorganism_parsed.gtf lane9 | perl -pe "s/=/ /g" > metadata/testorganism_ready.gtf
        fi
        
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: ~/projects/dev/pipe/scbe_pipe/.snakemake/log/2023-03-08T111403.605948.snakemake.log

To be honest I am a bit at a loss. Since the bash code works outside snakemake and not inside, I am unable to debug as I would normally do. I would appreciate some feedback and the experience of a more trained eye. I can provide a minimal gtf as an example if that helps. Thank you in advance.

Things I have tried:

After asking ChatGPT, it occurred to me to change the perl expressions for awk, since I am more familiar with quotation for awk, to change echo for printf calls, and to change the backticks for $() format. See below. Snakemake however still fails with an identical error.

rule preparegtf:
    input:"metadata/{organism}_parsed.gtf"
    output:"metadata/{organism}_ready.gtf"
    params:
        awkexpr1 = r""" $3 == "transcript" """,
        awkexpr2 = r"""{{print $9}}""",
        awkexpr3 = r"""{{gsub(/.*?(gene|transcript)_id=([^;]+).*?/;\1_name=\2;"); print $0}}""",
        awkexpr4 = r"""{{gsub(/;;+/,/;/); print $0}}""",
        awkexpr5 = r"""{{gsub(/=/," "); print $0}}"""
    shell:
        '''
        field9=$(awk {params.awkexpr1:q} {input} | awk {params.awkexpr2:q} | head -n1)
        pattern='(gene|transcript)_name'
        if [[ $field9 =~ $pattern ]]; then
            printf "Detected gene_name and/or transcript_name in field9, no need for preparing the gtf\n"
            perl -pe "s/=/ /g" {input} > {output}
        else
            printf "gene_name and/or transcript_name not detected in field9\n"
            awk {params.awkexpr3:q} | \
            perl -pe {params.awkexpr4:q} > lane9
            paste -d ' ' temp1.gtf lane9 | awk {params.awkexpr5:q} > {output}
        fi
        '''

Solution

  • I think the problem is here:

    field9=$(awk {params.awkexpr1:q} {input} | awk {params.awkexpr2:q} | head -n1)
    

    head terminates the pipe before all the input has been read and this causes a SIGPIPE error. Because snakemake runs in strict mode it fails while on the terminal it succeeds, or at least you don't see the error.

    Instead of head -n1 try awk 'NR == 1'. This will read the whole input and do nothing except print the first line and avoid to break the pipe. Besides, it's quite tricky that head will break the pipe only if the input is big enough which means you may see it failing in some cases but not others.