awksnakemake

Awk return success code on exit to snakemake


I'm using an awk script to process a text file, when it encounters a certain string, it exits.

{
    if ($1 ~ /^>/){
        if (($1 ~ /.{5}.+/) || ($1 ~ /^>MT/)) {
            exit
        } else {
            print ">chr"substr($1,2)"_"genome, $3, $4, $5
        } 
    } else {
        print
    }
}       

The script works perfectly fine and does what I want when running it from the terminal (using my snakemake micromamba env, so should be same awk version), but when I run it using snakemake, I get this message:

Error in rule filter_raw_genome:
jobid: 2
input: DATA/GENOMES/RAW/homo_sapiens/GRCh38.fa.gz
output: DATA/GENOMES/RAW/homo_sapiens/GRCh38.fa
log: snakemake_logs/filter_raw_genome/homo_sapiens_GRCh38.log (check log file(s) for error details)
shell:
gunzip -c DATA/GENOMES/RAW/homo_sapiens/GRCh38.fa.gz | awk -f SCRIPTS/chrom_filer_spike.awk -v genome=GRCh38 > DATA/GENOMES/RAW/homo_sapiens/GRCh38.fa
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

The GNU awk manual states "If an argument is supplied to exit, its value is used as the exit status code for the awk process. If no argument is supplied, exit causes awk to return a “success” status.", so I'm not sure where the error code snakemake is reading is coming from.

Is there any way for awk to break its processing loop early without generating an error code?

EDIT: log file is empty. If there's a way to make the script more explicit and check if that's indeed what's making it fail, I can try that out.


Solution

  • This was answered in the comments above but I'll summarise for the benefit of anyone reading this later.

    The error is most commonly seen with a command like this:

    gunzip -c {input} | head -n 4
    

    In this case a custom awk script is playing the part of head but the effect is the same. Because the second command in the pipe does not consume all the lines from {input}, the gunzip command may exit with a non-zero status (specifically code 141, as the shell forcibly closes the pipe). Normally this has no effect but, as the Snakemake error says, "snakemake uses bash strict mode!".

    "Bash strict mode" is -euo pipefail. That is, any non-zero exit from a command within a pipe will be an error, and (due to -e) that error will immediately cause a failure. Annoyingly, there is no error message and Snakemake does not report the script exit status (ie. 141) so it's not really obvious what is wrong.

    There are several possible fixes:

    1. Alter the awk script to consume all the input rather than having the early exit
    2. Use an awkward construct like (gunzip -c {input} || true) | head -n 4
    3. Disable pipefail for this shell code by preceding the command with set +o pipefail ;

    In this case, the final answer is the neatest. There is a possibility it might mask the error from gzip if the input file was ever corrupted, but this can be checked very easily.

    Other comments suggested that there might be further problems with this particular awk script, but the above is the answer to the original question as confirmed by the author.