I have a checkpoint rule that is creating files, and each of those files has to be processed by rule BaseCellCounter
. I don't find how to make snakemake understand that bam="SplitBam/{scDNA}/{scDNA}.{clone}.bam"
, is created by checkpoint SplitBam
, and linking rule AggregateSplitBamOutput
doesn't help. As a result the DAG is not built.
all_clones_all_samples = {'Sample_1': ['clone_1'], 'Sample_2': ['clone_1', 'clone_2']}
SAMPLES = ['Sample_1', 'Sample_2']
def get_all_splitbam_file_names(wildcards):
print('before checkpoint')
split_dir = checkpoints.SplitBam.get(**wildcards).output["split"]
print('after checkpoint')
CLONES = glob_wildcards(f'{split_dir}/{wildcards.scDNA}.{{clone}}.bam').clone
return expand(f"{split_dir}/{wildcards.scDNA}.{{clone}}.bam",clone=CLONES)
rule all:
input:
expand("MergeCounts/{scDNA}.BaseCellCounts.AllCellTypes.tsv",
scDNA = SAMPLES),
default_target: True
checkpoint SplitBam:
input:
bam = f"{DATA}/{{scDNA}}_scDNA.bam",
output:
split = directory("SplitBam/{scDNA}")
shell:
"mkdir -p {output.split} && python create_some_files.py {input.bam}"
rule AggregateSplitBamOutput:
input:
bam=get_all_splitbam_file_names,
output:
touch("SplitBam/{scDNA}.split_done.txt")
rule BaseCellCounter:
input:
txt="SplitBam/{scDNA}.split_done.txt",
bam="SplitBam/{scDNA}/{scDNA}.{clone}.bam",
output:
tsv="BaseCellCounter/{scDNA}/{scDNA}.{clone}.tsv",
def MergeCountsInput(wildcards):
return expand(f"BaseCellCounter/{{scDNA}}/{{scDNA}}.{clone}.tsv",
clone=all_clones_all_samples[wildcards.scDNA])
rule MergeCounts:
input:
MergeCountsInput,
output:
tsv = "MergeCounts/{scDNA}.BaseCellCounts.AllCellTypes.tsv"
Building DAG of jobs...
before chepoint
MissingInputException in rule BaseCellCounter in file SnakeFile.smk, line 96:
Missing input files for rule BaseCellCounter:
output: BaseCellCounter/Sample1/Sample_1.Clone_1.tsv,
wildcards: scDNA=Sample_1, clone=Clone_1
affected files:
SplitBam/Sample_1/Sample_1.Clone_1.bam
As you can see in :
def get_all_splitbam_file_names(wildcards):
print('before checkpoint')
split_dir = checkpoints.SplitBam.get(**wildcards).output["split"]
print('after checkpoint')
CLONES = glob_wildcards(f'{split_dir}/{wildcards.scDNA}.{{clone}}.bam').clone
return expand(f"{split_dir}/{wildcards.scDNA}.{{clone}}.bam",clone=CLONES)
the print('before checkpoint')
statement gets printed but not the print('after checkpoint')
. Do I use checkpoints incorrectly?
I indeed used checkpoints incorrectly. The logic was flawed, I aggregated too early, you should only aggregate when the wildcard first appears in snakemake's logic. The following logic worked:
SAMPLES = ['Sample_1', 'Sample_2']
def aggregate(wildcards):
split_dir = checkpoints.SplitBam_scDNAValid.get(**wildcards).output["split"]
CLONES = glob_wildcards(f'{split_dir}/{wildcards.scDNA}.{{clone}}.bam').clone
return expand(
f"BaseCellCounter/{wildcards.scDNA}/{wildcards.scDNA}.{{clone}}.tsv",
clone=CLONES)
rule all:
input:
expand("MergeCounts/{scDNA}.BaseCellCounts.AllCellTypes.tsv",
scDNA = SAMPLES),
default_target: True
checkpoint SplitBam:
input:
bam = f"{DATA}/{{scDNA}}_scDNA.bam",
output:
split = directory("SplitBam/{scDNA}")
shell:
"mkdir -p {output.split} && python create_some_files.py {input.bam}"
rule BaseCellCounter:
input:
bam="SplitBam/{scDNA}/{scDNA}.{clone}.bam",
output:
tsv="BaseCellCounter/{scDNA}/{scDNA}.{clone}.tsv",
rule MergeCounts:
input:
aggregate
output:
tsv = "MergeCounts/{scDNA}.BaseCellCounts.AllCellTypes.tsv"
Also, this way I don't even need all_clones_all_samples
as I don't need to know in advance which clones are present.