ubuntu unix command-line-interface bioinformatics snakemake

Multiple variables in input and output using wildcards snakemake

rule all:
  input:
      expand("tissues/{id}/atac_seq/{srr}/{srr}_S1_L001_R1_001.fastq.gz", id =IDs, srr = df[df['library_type']=='Chromatin Accessibility']['Library']),
      expand("tissues/{id}/atac_seq/{srr}/{srr}_S1_L001_R2_001.fastq.gz", id =IDs, srr = df[df['library_type']=='Chromatin Accessibility']['Library']),
      expand("tissues/{id}/atac_seq/{srr}/{srr}_S1_L001_R3_001.fastq.gz", id =IDs, srr = df[df['library_type']=='Chromatin Accessibility']['Library'])

rule rename_atac_files:
    input:
      "tissues/{id}/atac_seq/{srr}/{srr}_1.fastq.gz",
      "tissues/{id}/atac_seq/{srr}/{srr}_2.fastq.gz",
      "tissues/{id}/atac_seq/{srr}/{srr}_3.fastq.gz"
    output:
      "tissues/{id}/atac_seq/{srr}/{srr}_S1_L001_R1_001.fastq.gz",
      "tissues/{id}/atac_seq/{srr}/{srr}_S1_L001_R2_001.fastq.gz",
      "tissues/{id}/atac_seq/{srr}/{srr}_S1_L001_R3_001.fastq.gz"
    shell:
      """
      mv tissues/{wildcards.id}/atac_seq/{wildcards.srr}/{wildcards.srr}_1.fastq.gz tissues/{wildcards.id}/atac_seq/{wildcards.srr}/{wildcards.srr}_S1_L001_R1_001.fastq.gz
      mv tissues/{wildcards.id}/atac_seq/{wildcards.srr}/{wildcards.srr}_2.fastq.gz tissues/{wildcards.id}/atac_seq/{wildcards.srr}/{wildcards.srr}_S1_L001_R2_001.fastq.gz
      mv tissues/{wildcards.id}/atac_seq/{wildcards.srr}/{wildcards.srr}_3.fastq.gz tissues/{wildcards.id}/atac_seq/{wildcards.srr}/{wildcards.srr}_S1_L001_R3_001.fastq.gz
      """

Error:Missing input files for rule rename_atac_files:

I want to rename all the files in the respective folders such that it is in the format required for cellranger. The problem is, I have multiple variables in the input and output paths (i.e. "id" and "srr). Thus, when I execute the script, I get an error since some jobs look for lets say "tissues/A/atac_seq/SRR123/SRR123_1.fastq.gz" when it only exists in id "B": "tissues/B/atac_seq/SRR123/SRR123_1.fastq.gz".

How can I fix this problem such that snakemake knows to only execute the job for the {id} and {srr} that exists? it would be best if the directory structure is maintained.

Solution

Renaming of input files is not something Snakemake does particularly well. I'd advise fixing the file names first outside of Snakemake, using a regular shell script, then running the Snakemake pipeline after that.

If you have the "prename" command available (apt install rename in Debian/Ubuntu) this can be a one-liner, something like:

$ prename -n -v 's/_(.).fastq.gz/_S1_L001_R\1_001.fastq.gz/' tissues/*/atac_seq/*/*_?.fastq.gz

In your example you have three reads and you want to rename them R1, R2, and R3. This looks a bit funny to me. Isn't one of them an index read - ie. it should be read I1?