snakemake

Snakemake rule does not recognize output from other rule


I have written a Snakefile "prepare_tuples.smk" having prepare_tuples as my head rule. The input is defined as the output of the rule hadd_tuples. When I run snakemake prepare_tuples -F -c20 I get the following error message:

Missing input files for rule prepare_tuples:
    affected files:
        /ceph/users/jmainusch/data/Bs2MuMu/2024_data_stripped.root
        /ceph/users/jmainusch/simulation/Bs2MuMu/2024_data_stripped_truth-matched.root
        /ceph/users/jmainusch/simulation/Bs2JpsiPhi/2024_data_stripped_truth-matched.root
        /ceph/users/jmainusch/simulation/Bu2JpsiK/2024_data_stripped_truth-matched.root
        /ceph/users/jmainusch/data/Bd2MuMu/2024_data_stripped.root
        /ceph/users/jmainusch/data/Bs2JpsiPhi/2024_data_stripped.root
        /ceph/users/jmainusch/data/Bs2KK/2024_data_stripped.root
        /ceph/users/jmainusch/simulation/Bd2MuMu/2024_data_stripped_truth-matched.root
        /ceph/users/jmainusch/simulation/Bd2KPi/2024_data_stripped_truth-matched.root
        /ceph/users/jmainusch/data/Bd2KPi/2024_data_stripped.root
        /ceph/users/jmainusch/simulation/Bd2PiPi/2024_data_stripped_truth-matched.root
        /ceph/users/jmainusch/data/Bu2JpsiK/2024_data_stripped.root
        /ceph/users/jmainusch/simulation/Bs2KK/2024_data_stripped_truth-matched.root
        /ceph/users/jmainusch/data/Bd2PiPi/2024_data_stripped.root
        /ceph/users/jmainusch/simulation/Bs2KPi/2024_data_stripped_truth-matched.root
        /ceph/users/jmainusch/data/Bs2KPi/2024_data_stripped.root

On a side note: "prepare_tuples.smk" is part of a bigger Snakefile. I dont expect this as a problem, but mention it for completeness.

From my understanding, Snakemake should recognize that the desired file is produced in "hadd_tuples" and proceed with its production, but it does not.

configfile: "config/config.yaml"

path = config["repopath"]
inpath = config["cephpath"]
outpath = config["userpath"]
dataset = config["dataset"]

configfile: path + "/1_prepare_tuples/configs/tuples.yaml"

wildcard_constraints:
    src = "data|simulation",
    tuple = "^B.*"

rule prepare_tuples:
    input:
        data = expand(outpath + "data/{tuple}/2024_data_stripped.root", tuple = config["tuples"].keys()),
        sim = expand(outpath + "simulation/{tuple}/2024_data_stripped_truth-matched.root", tuple = config["tuples"].keys()),

rule cut:
    input: 
        script = path + "1_prepare_tuples/scripts/cut.py",
        in_data = inpath + "{src}/{tuple}/2024/{magnet}/data.root",
        cuts = path + "1_prepare_tuples/configs/cuts.yaml",
        truth_vars = path + "1_prepare_tuples/configs/truth-match.yaml",
        control_vars = path + "3_control_plots/configs/channels/{tuple}/plots.yaml",
        BDTS_vars = path + "4_BDTS/configs/variables.yaml",
    params:
        eff_path = "results/efficiencies.yaml",
        tuple_config = path + "/1_prepare_tuples/configs/tuples.yaml",
    output: 
        out_data = outpath + "{src}/{tuple}/{magnet}/2024_data_stripped.root",
    shell: 
        """
        python {input.script} \
        --path {input.in_data} \
        --channel {wildcards.tuple} \
        --tuple_config {params.tuple_config} \
        --source {wildcards.src} \
        --magnet {wildcards.magnet} \
        --outpath {output.out_data} \
        --effpath {params.eff_path} \
        --cuts {input.cuts} \
        --truth_vars {input.truth_vars} \
        --control_vars {input.control_vars} \
        --BDTS_vars {input.BDTS_vars} \
        """
        
rule truthmatch:
    input: 
        script = path + "1_prepare_tuples/scripts/truthmatch.py",
        in_data = outpath + "simulation/{tuple}/{magnet}/2024_data_stripped.root",
        cuts = path + "1_prepare_tuples/configs/truth-match.yaml",
        cut_vars = path + "1_prepare_tuples/configs/cuts.yaml",
        control_vars = path + "3_control_plots/configs/channels/{tuple}/plots.yaml",
        BDTS_vars = path + "4_BDTS/configs/variables.yaml",
    params:
        eff_path = "results/efficiencies.yaml",
        tuple_config = path + "/1_prepare_tuples/configs/tuples.yaml",
    output: 
        out_data = outpath + "simulation/{tuple}/{magnet}/2024_data_stripped_truth-matched.root",
    shell: 
        """
        python {input.script} \
        --path {input.in_data} \
        --channel {wildcards.tuple} \
        --tuple_config {params.tuple_config} \
        --magnet {wildcards.magnet} \
        --outpath {output.out_data} \
        --effpath {params.eff_path} \
        --cuts {input.cuts} \
        --cut_vars {input.cut_vars} \
        --control_vars {input.control_vars} \
        --BDTS_vars {input.BDTS_vars} \
        """

rule hadd_tuples:
    input:
        up = outpath + "{src}/{tuple}/MagUp/2024_data_{mod}.root",
        down = outpath + "{src}/{tuple}/MagDown/2024_data_{mod}.root"
    output:
        outpath + "{src}/{tuple}/2024_data_{mod}.root",
    shell:
        "hadd {output} {input.up} {input.down}"

Solution

  • The issue here is that wildcards in Snakemake can match across the whole of the file path, including '/' separators and literal '.' chars, unless you constrain them otherwise. So I think you want:

    wildcard_constraints:
        src   = "data|simulation",
        tuple = "B[^/.]+",
        mod   = "[^/.]+",
    

    (Using \w+ to specifically match a sequence of regular alphanumeric characters is often a good option, but it won't match hyphens only underscores.)

    This should eliminate the "wildcard periodically repeated" error which is caused by the fact that your rule hadd_tuples is (unintentionally) recursive. For example when Snakemake tries to make the file:

    /ceph/users/jmainusch/data/Bs2MuMu/MagUp/2024_data_stripped.root
    

    It currently matches that to:

    /ceph/users/jmainusch/{src=data}/{tuple=Bs2MuMu/MagUp}/2024_data_{mod=stripped}.root
    

    Obviously {tuple=Bs2MuMu/MagUp} is nonsense but without a wildcard constraint Snakemake will make this substitution, and this in turn produces a nonsense input which in turn gets matched to the outputs of the same rule, and so on with the {tuple} wildcard absorbing ever more junk until Snakemake gives up.

    One other point... consider setting:

    workdir: config["userpath"]
    

    rather than adding outpath + to all your inputs and outputs. For one thing, this makes it much easier to test a rule like hadd_tuples in isolation, as well as making the code more legible.

    Edited to add note:

    From memory, I think using ^ and $ anchors in your wildcard constraints doesn't work because they only match at the beginning/end of the entire filename, not at the start and end of the wildcard. But don't quote me on that I've not tested it!