snakemake

Snakemake checkpoint is not evaluated


I have a checkpoint rule that is creating files, and each of those files has to be processed by rule BaseCellCounter. I don't find how to make snakemake understand that bam="SplitBam/{scDNA}/{scDNA}.{clone}.bam", is created by checkpoint SplitBam, and linking rule AggregateSplitBamOutput doesn't help. As a result the DAG is not built.

all_clones_all_samples = {'Sample_1': ['clone_1'], 'Sample_2': ['clone_1', 'clone_2']}
SAMPLES = ['Sample_1', 'Sample_2']

def get_all_splitbam_file_names(wildcards):
    print('before checkpoint')
    split_dir = checkpoints.SplitBam.get(**wildcards).output["split"]
    print('after checkpoint')
    CLONES = glob_wildcards(f'{split_dir}/{wildcards.scDNA}.{{clone}}.bam').clone
    return expand(f"{split_dir}/{wildcards.scDNA}.{{clone}}.bam",clone=CLONES)

rule all:
    input:
        expand("MergeCounts/{scDNA}.BaseCellCounts.AllCellTypes.tsv", 
        scDNA = SAMPLES),
    default_target: True

checkpoint SplitBam:
    input:
        bam = f"{DATA}/{{scDNA}}_scDNA.bam",
    output:
        split = directory("SplitBam/{scDNA}")
    shell:
        "mkdir -p {output.split} && python create_some_files.py {input.bam}"

rule AggregateSplitBamOutput:
    input:
        bam=get_all_splitbam_file_names,
    output:
        touch("SplitBam/{scDNA}.split_done.txt")

rule BaseCellCounter:
    input:
        txt="SplitBam/{scDNA}.split_done.txt",
        bam="SplitBam/{scDNA}/{scDNA}.{clone}.bam",
    output:
        tsv="BaseCellCounter/{scDNA}/{scDNA}.{clone}.tsv",

def MergeCountsInput(wildcards):
    return expand(f"BaseCellCounter/{{scDNA}}/{{scDNA}}.{clone}.tsv", 
            clone=all_clones_all_samples[wildcards.scDNA])

rule MergeCounts:
    input:
        MergeCountsInput,
    output:
        tsv = "MergeCounts/{scDNA}.BaseCellCounts.AllCellTypes.tsv"


Building DAG of jobs...
before chepoint
MissingInputException in rule BaseCellCounter in file SnakeFile.smk, line 96:
Missing input files for rule BaseCellCounter:
    output: BaseCellCounter/Sample1/Sample_1.Clone_1.tsv,
    wildcards: scDNA=Sample_1, clone=Clone_1
    affected files:
        SplitBam/Sample_1/Sample_1.Clone_1.bam

As you can see in :

def get_all_splitbam_file_names(wildcards):
    print('before checkpoint')
    split_dir = checkpoints.SplitBam.get(**wildcards).output["split"]
    print('after checkpoint')
    CLONES = glob_wildcards(f'{split_dir}/{wildcards.scDNA}.{{clone}}.bam').clone
    return expand(f"{split_dir}/{wildcards.scDNA}.{{clone}}.bam",clone=CLONES)

the print('before checkpoint') statement gets printed but not the print('after checkpoint'). Do I use checkpoints incorrectly?


Solution

  • I indeed used checkpoints incorrectly. The logic was flawed, I aggregated too early, you should only aggregate when the wildcard first appears in snakemake's logic. The following logic worked:

    SAMPLES = ['Sample_1', 'Sample_2']
    
    def aggregate(wildcards):
        split_dir = checkpoints.SplitBam_scDNAValid.get(**wildcards).output["split"]
        CLONES = glob_wildcards(f'{split_dir}/{wildcards.scDNA}.{{clone}}.bam').clone
        return expand(
            f"BaseCellCounter/{wildcards.scDNA}/{wildcards.scDNA}.{{clone}}.tsv",
            clone=CLONES)
    
    rule all:
        input:
            expand("MergeCounts/{scDNA}.BaseCellCounts.AllCellTypes.tsv", 
            scDNA = SAMPLES),
        default_target: True
    
    checkpoint SplitBam:
        input:
            bam = f"{DATA}/{{scDNA}}_scDNA.bam",
        output:
            split = directory("SplitBam/{scDNA}")
        shell:
            "mkdir -p {output.split} && python create_some_files.py {input.bam}"
    
    rule BaseCellCounter:
        input:
            bam="SplitBam/{scDNA}/{scDNA}.{clone}.bam",
        output:
            tsv="BaseCellCounter/{scDNA}/{scDNA}.{clone}.tsv",
    
    rule MergeCounts:
        input:
            aggregate
        output:
            tsv = "MergeCounts/{scDNA}.BaseCellCounts.AllCellTypes.tsv"
    

    Also, this way I don't even need all_clones_all_samples as I don't need to know in advance which clones are present.