snakemake

Snakemake: adapt the pipeline to the new version 8.18.1 (example of combining checkpoint use inside a input function)


For a task, I need to implement the logic as in the script below. (Script source). But in the current version the flag 'directory' is only valid for outputs, not inputs.

How can this be done in the current version?

OUTDIR = "first_directory"
SNDDIR = "second_directory"
THRDIR = "third_directory"


def combine(wildcards):
    # read the first set of outputs
    ck_output = checkpoints.make_some_files.get(**wildcards).output[0]
    FIRSTS, = glob_wildcards(os.path.join(ck_output, "{sample}.txt"))
    # read the second set of outputs
    sn_output = checkpoints.make_more_files.get(**wildcards).output[0]
    SECONDS, = glob_wildcards(os.path.join(sn_output, "{smpl}.txt"))
    return expand(os.path.join(THRDIR, "{first}.{second}.tsv"), first=FIRSTS, second=SECONDS)

rule all:
    input: 
        combine

checkpoint make_some_files:
    output:
        directory(OUTDIR)
    shell:
        """
        mkdir {output};
        N=$(((RANDOM%5)+1));
        for D in $(seq $N); do
            touch {output}/$RANDOM.txt
        done
        """

checkpoint make_more_files:
    output:
        directory(SNDDIR)
    shell:
        """
        mkdir {output};
        N=$(((RANDOM%5)+1));
        for D in $(seq $N); do
            touch {output}/$RANDOM.txt
        done
        """

rule make_third_files:
    input:
        directory(OUTDIR),
        directory(SNDDIR),
    output:
        os.path.join(THRDIR, "{first}.{second}.tsv")
    shell:
        """
        touch {output}
        """

Soluton: link


Solution

  • You can still use the paths to the directories in input directive, just not with the special directory() flag meant to signal it is okay to delete it at the start of the run & to not detect changes to how a directory is displayed as change necessitating re-running the rules, as discussed in the documentation here.
    This is the modified version that worked for me with Snakemake version 8.18.1.

    OUTDIR = "first_directory"
    SNDDIR = "second_directory"
    THRDIR = "third_directory"
    
    
    def combine(wildcards):
        # read the first set of outputs
        ck_output = checkpoints.make_some_files.get(**wildcards).output[0]
        FIRSTS, = glob_wildcards(os.path.join(ck_output, "{sample}.txt"))
        # read the second set of outputs
        sn_output = checkpoints.make_more_files.get(**wildcards).output[0]
        SECONDS, = glob_wildcards(os.path.join(sn_output, "{smpl}.txt"))
        return expand(os.path.join(THRDIR, "{first}.{second}.tsv"), first=FIRSTS, second=SECONDS)
    
    rule all:
        input: 
            OUTDIR,
            SNDDIR,
            combine
    
    checkpoint make_some_files:
        output:
            directory(OUTDIR)
        shell:
            """
            mkdir {output};
            N=$(((RANDOM%5)+1));
            for D in $(seq $N); do
                touch {output}/$RANDOM.txt
            done
            """
    
    checkpoint make_more_files:
        output:
            directory(SNDDIR)
        shell:
            """
            mkdir {output};
            N=$(((RANDOM%5)+1));
            for D in $(seq $N); do
                touch {output}/$RANDOM.txt
            done
            """
    
    rule make_third_files:
        input:
            OUTDIR,
            SNDDIR,
        output:
            os.path.join(THRDIR, "{first}.{second}.tsv")
        shell:
            """
            touch {output}
            """
    

    The other change is that I added the initial two output directories to the input of the main rule.

    You can see the set-up for working it out in a Jupyter session and result here.
    You can even easily run it yourself without touching your system in a temporary Jupyter session served by MyBinder.org if you go there and follow the guide at the top to launch a session, and then upload that notebook to it and run all the cells.