ubuntuunixcommand-line-interfacebioinformaticssnakemake

Multiple variables in input and output using wildcards snakemake


rule all:
  input:
      expand("tissues/{id}/atac_seq/{srr}/{srr}_S1_L001_R1_001.fastq.gz", id =IDs, srr = df[df['library_type']=='Chromatin Accessibility']['Library']),
      expand("tissues/{id}/atac_seq/{srr}/{srr}_S1_L001_R2_001.fastq.gz", id =IDs, srr = df[df['library_type']=='Chromatin Accessibility']['Library']),
      expand("tissues/{id}/atac_seq/{srr}/{srr}_S1_L001_R3_001.fastq.gz", id =IDs, srr = df[df['library_type']=='Chromatin Accessibility']['Library'])

rule rename_atac_files:
    input:
      "tissues/{id}/atac_seq/{srr}/{srr}_1.fastq.gz",
      "tissues/{id}/atac_seq/{srr}/{srr}_2.fastq.gz",
      "tissues/{id}/atac_seq/{srr}/{srr}_3.fastq.gz"
    output:
      "tissues/{id}/atac_seq/{srr}/{srr}_S1_L001_R1_001.fastq.gz",
      "tissues/{id}/atac_seq/{srr}/{srr}_S1_L001_R2_001.fastq.gz",
      "tissues/{id}/atac_seq/{srr}/{srr}_S1_L001_R3_001.fastq.gz"
    shell:
      """
      mv tissues/{wildcards.id}/atac_seq/{wildcards.srr}/{wildcards.srr}_1.fastq.gz tissues/{wildcards.id}/atac_seq/{wildcards.srr}/{wildcards.srr}_S1_L001_R1_001.fastq.gz
      mv tissues/{wildcards.id}/atac_seq/{wildcards.srr}/{wildcards.srr}_2.fastq.gz tissues/{wildcards.id}/atac_seq/{wildcards.srr}/{wildcards.srr}_S1_L001_R2_001.fastq.gz
      mv tissues/{wildcards.id}/atac_seq/{wildcards.srr}/{wildcards.srr}_3.fastq.gz tissues/{wildcards.id}/atac_seq/{wildcards.srr}/{wildcards.srr}_S1_L001_R3_001.fastq.gz
      """

Error:Missing input files for rule rename_atac_files:


     

I want to rename all the files in the respective folders such that it is in the format required for cellranger. The problem is, I have multiple variables in the input and output paths (i.e. "id" and "srr). Thus, when I execute the script, I get an error since some jobs look for lets say "tissues/A/atac_seq/SRR123/SRR123_1.fastq.gz" when it only exists in id "B": "tissues/B/atac_seq/SRR123/SRR123_1.fastq.gz".

How can I fix this problem such that snakemake knows to only execute the job for the {id} and {srr} that exists? it would be best if the directory structure is maintained.


Solution

  • Renaming of input files is not something Snakemake does particularly well. I'd advise fixing the file names first outside of Snakemake, using a regular shell script, then running the Snakemake pipeline after that.

    If you have the "prename" command available (apt install rename in Debian/Ubuntu) this can be a one-liner, something like:

    $ prename -n -v 's/_(.).fastq.gz/_S1_L001_R\1_001.fastq.gz/' tissues/*/atac_seq/*/*_?.fastq.gz
    

    In your example you have three reads and you want to rename them R1, R2, and R3. This looks a bit funny to me. Isn't one of them an index read - ie. it should be read I1?