snakemake

How Do I Combine Parameters Conditionally?


Suppose I have the following input FASTQ files:

 $ ls  -1 inputs/ 
CM0009619-ABB_L01_Read1_Sample_Library_XY102.fastq.gz
CM0009619-ABB_L01_Read1_Sample_Library_XY110.fastq.gz
CM0009619-ABB_L01_Read1_Sample_Library_XY84.fastq.gz
CM0009619-ABB_L01_Read2_Sample_Library_XY102.fastq.gz
CM0009619-ABB_L01_Read2_Sample_Library_XY110.fastq.gz
CM0009619-ABB_L01_Read2_Sample_Library_XY84.fastq.gz
CM0009619-ABB_L02_Read1_Sample_Library_XY84.fastq.gz
CM0009619-ABB_L02_Read1_Sample_Library_XY88.fastq.gz
CM0009619-ABB_L02_Read2_Sample_Library_XY84.fastq.gz
CM0009619-ABB_L02_Read2_Sample_Library_XY88.fastq.gz

There are three variables: the lane identifier (L0*), read identifier (Read*) and the sample identifier (XY*).

All samples have both Read1 and Read2. Most samples belong to just 1 lane, but they can belong to both. Therefore, the combination of {sample} and {lane} must be conditional.

I'm trying to figure this out using a simple test workflow but the solution has eluded me for many hours

SAMPLES=( "XY84", "XY88", "XY102", "XY110" )

$ snakemake --snakefile Snakefile --cores 1
Building DAG of jobs...
MissingInputException in line 10 of /home/UNIXHOME/sholt/smktests/Snakefile:
Missing input files for rule copy_files:
inputs/CM0009619-ABB_L02_Read2_Sample_Library_XY102.fastq.gz
inputs/CM0009619-ABB_L01_Read1_Sample_Library_XY88.fastq.gz
inputs/CM0009619-ABB_L02_Read1_Sample_Library_XY102.fastq.gz
inputs/CM0009619-ABB_L02_Read1_Sample_Library_XY110.fastq.gz
inputs/CM0009619-ABB_L01_Read2_Sample_Library_XY88.fastq.gz
inputs/CM0009619-ABB_L02_Read2_Sample_Library_XY110.fastq.gz

rule copy_files:
  input:
    expand("inputs/CM0009619-ABB_L0{lane}_Read{read}_Sample_Library_{sample}.fastq.gz", sample=SAMPLES, read=[1,2], lane=[1,2] )   
  output:
    expand("outputs/CM0009619-ABB_L0{lane}_Read{read}_Sample_Library_{sample}.fastq.gz", sample=SAMPLES, read=[1,2], lane=[1,2] )
  shell:
    "cp {input} {output}"   

Note that in the 'real' version of this workflow I'm putting together, the samples would be read in from a list file specified in a config file. I've done it using a hardcoded tuple for simplicity.

This workflow gives the error:

$ snakemake --snakefile Snakefile --cores 1
Building DAG of jobs...
MissingInputException in line 10 of /home/UNIXHOME/sholt/smktests/Snakefile:
Missing input files for rule copy_files:
inputs/CM0009619-ABB_L02_Read2_Sample_Library_XY102.fastq.gz
inputs/CM0009619-ABB_L01_Read1_Sample_Library_XY88.fastq.gz
inputs/CM0009619-ABB_L02_Read1_Sample_Library_XY102.fastq.gz
inputs/CM0009619-ABB_L02_Read1_Sample_Library_XY110.fastq.gz
inputs/CM0009619-ABB_L01_Read2_Sample_Library_XY88.fastq.gz
inputs/CM0009619-ABB_L02_Read2_Sample_Library_XY110.fastq.gz

The expansion function for the input, as written, produces invalid combinations of sample name and lane number. I've not managed to find a way to get it to combine only lane and sample IDs which are valid, and ignore the invalid ones.

Is there a simple way to have the lane value depend on the sample value? Is Snakemake inappropriate for this scenario?

I'm using v5.26.1 of Snakemake but could consider a more recent version if it gets me to a solution.


Solution

  • The problem originally presented could be solved with a Snakefile containing the following code:

    rule all:
        input:
            expand("outputs/{file}.fastq.gz", file=glob_wildcards("inputs/{file}.fastq.gz").file)
    
    rule copy_fastq:
        input:
            "inputs/{file}.fastq.gz"
        output:
            "outputs/{file}.fastq.gz"
        shell:
            "cp {input} {output}"
    

    See also here.

    HOWEVER,
    I put in my comments below the original post how I suspect the simple test workflow provided in the post at the time of writing that 'solution' above didn't adequately address testing the full array of issues the poster faces.

    So there's more to think about my example will use the file identifiers to make sub-directories to organize the output....

    See here for a good summary where it says:

    "expand is a convenience utility, for more complex cases it's typically faster to generate the desired list directly using python"

    And since a Snakefile is a superset of Python, you can get your conditional just using an if conditional like Python does.
    Also see here. may also be in order since I suspect you want to use those three variables to make file names? Let's come back to that below. First, I want to address SAMPLES.

    Because a Snakefile is a Python file with more, you can easily put other Python in the top of your Snakefile and it will run first and set things up. For example, I'd suggest adding this instead of drafting SAMPLES by hand:

    import os
    def create_samples_dict(directory):
        samples = {}
        for filename in os.listdir(directory):
            if filename.endswith('.fastq.gz'):
                parts = filename.split('-ABB_L0')
                if len(parts) == 2:
                    lane = int(parts[1][0])
                    sample = parts[1].split('_Sample_Library_')[1].split('.')[0]
                    if sample not in samples:
                        samples[sample] = set()
                    samples[sample].add(lane)
        # Convert sets to tuples or single integers
        for sample, lanes in samples.items():
            samples[sample] = tuple(sorted(lanes)) if len(lanes) > 1 else lanes.pop()
        return samples
    SAMPLES = create_samples_dict('inputs')
    

    That way you can be confident SAMPLES has everything, no matter how big your list of files gets.

    Let's make a list of the files to use as input and make the final files and paths in output to make from with an input function. (It is called an 'input function' because the function is commonly generating input; in this case it will be an input in the 'rule all' rule.) Along the way collect all sorts of information to keep associated with that, those three variables you want. That way you can use it. I'm going to collect all you listed in the original post, but demonstrate making use of some of that information using just the sample identifier for now.

    Now that all is handled, let's get back to that input function. You can make up new file names the files in inputs using an input function, for example see here.

    Putting it all together. (I don't see the need for SAMPLES but I left it in here):

    import os
    import re
    
    # Programmatically make 'SAMPLES'
    def create_samples_dict(directory):
        samples = {}
        for filename in os.listdir(directory):
            if filename.endswith('.fastq.gz'):
                parts = filename.split('-ABB_L0')
                if len(parts) == 2:
                    lane = int(parts[1][0])
                    sample = parts[1].split('_Sample_Library_')[1].split('.')[0]
                    if sample not in samples:
                        samples[sample] = set()
                    samples[sample].add(lane)
        # Convert sets to tuples or single integers
        for sample, lanes in samples.items():
            samples[sample] = tuple(sorted(lanes)) if len(lanes) > 1 else lanes.pop()
        return samples
    SAMPLES = create_samples_dict('inputs')
    
    # Create dictionary of the files in `inputs` with the file names as keys and the three variables in a tuple
    def create_file_info_dict(directory):
        file_info = {}
        pattern = r'CM\d+-ABB_(L0\d)_(Read\d)_Sample_Library_(XY\d+)\.fastq\.gz'
    
        for filename in os.listdir(directory):
            if filename.endswith('.fastq.gz'):
                match = re.match(pattern, filename)
                if match:
                    lane, read, sample = match.groups()
                    file_info[filename] = (lane, read, sample)
    
        return file_info
    
    file_info = create_file_info_dict('inputs')
    
    def get_sample_files(wildcards):
        return [file for file, info in file_info.items() if info[2] == wildcards.sample]
    
    def get_output_files(wildcards):
        return [f"outputs/{info[2]}/{file}" for file, info in file_info.items()]
    
    rule all:
        input:
            get_output_files
    
    def get_input_file(wildcards):
        return os.path.join("inputs", wildcards.file)
    
    rule copy_and_organize:
        input:
            get_input_file
        output:
            "outputs/{sample}/{file}"
        shell:
            "cp {input} {output}"