amazon-s3aws-fargatesnakemakeaws-batch

Snakemake running as an AWS Batch or an AWS Fargate task raises MissingInputException on the inputs stored on a S3 bucket


We have a Dockerized Snakemake pipeline with the input data stored on a S3 bucket snakemake-bucket:

Snakefile:

rule bwa_map:
    input:
        "data/genome.fa"
    output:
        "results/mapped/A.bam"
    shell:
        "cat {input} > {output}"

Dockerfile

FROM snakemake/snakemake:v8.15.2
RUN mamba install -c conda-forge -c bioconda snakemake-storage-plugin-s3
WORKDIR /app
COPY ./workflow ./workflow
ENV PYTHONWARNINGS="ignore:Unverified HTTPS request"
CMD ["snakemake","--default-storage-provider","s3","--default-storage-prefix","s3://snakemake-bucket","results/mapped/A.bam","--cores","1","--verbose","--printshellcmds"]

When we run the container with the following command, it downloads the input file, runs the pipeline and stores the output on the bucket successfully:

docker run -it -e SNAKEMAKE_STORAGE_S3_ACCESS_KEY=**** -e SNAKEMAKE_STORAGE_S3_SECRET_KEY=****  our-snakemake:v0.0.10

However, when we deploy it as an AWS Batch Job or AWS Fargate Task, it gives the following error immediately:

Assuming unrestricted shared filesystem usage.
Building DAG of jobs...
Full Traceback (most recent call last):
  File "/opt/conda/envs/snakemake/lib/python3.12/site-packages/snakemake/cli.py", line 2103, in args_to_api
    dag_api.execute_workflow(
  File "/opt/conda/envs/snakemake/lib/python3.12/site-packages/snakemake/api.py", line 594, in execute_workflow
    workflow.execute(
  File "/opt/conda/envs/snakemake/lib/python3.12/site-packages/snakemake/workflow.py", line 1081, in execute
    self._build_dag()
  File "/opt/conda/envs/snakemake/lib/python3.12/site-packages/snakemake/workflow.py", line 1037, in _build_dag
    async_run(self.dag.init())
  File "/opt/conda/envs/snakemake/lib/python3.12/site-packages/snakemake/common/__init__.py", line 94, in async_run
    return asyncio.run(coroutine)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/snakemake/lib/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/snakemake/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/snakemake/lib/python3.12/asyncio/base_events.py", line 687, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/opt/conda/envs/snakemake/lib/python3.12/site-packages/snakemake/dag.py", line 183, in init
    job = await self.update(
          ^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/snakemake/lib/python3.12/site-packages/snakemake/dag.py", line 1013, in update
    raise exceptions[0]
  File "/opt/conda/envs/snakemake/lib/python3.12/site-packages/snakemake/dag.py", line 970, in update
    await self.update_(
  File "/opt/conda/envs/snakemake/lib/python3.12/site-packages/snakemake/dag.py", line 1137, in update_
    raise MissingInputException(job, missing_input)
snakemake.exceptions.MissingInputException: Missing input files for rule bwa_map:
    output: results/mapped/A.bam
    wildcards: sample=A
    affected files:
        s3://snakemake-bucket/data/genome.fa (storage)

MissingInputException in rule bwa_map in file /app/workflow/Snakefile, line 10:
Missing input files for rule bwa_map:
    output: results/mapped/A.bam
    wildcards: sample=A
    affected files:
        s3://snakemake-bucket/data/genome.fa (storage)

Any ideas would be appreciated


Solution

  • It seems in AWS Fargate sets some environment variables including AWS_CONTAINER_CREDENTIALS_RELATIVE_URI on which boto3 decide that it needs AWS_SECRET_ACCESS_KEY and AWS_ACCESS_KEY_ID in addition to SNAKEMAKE_STORAGE_S3_SECRET_KEY and SNAKEMAKE_STORAGE_S3_ACCESS_KEY. if you want to run Snakemake in AWS Fargate you have to set all 4 variables, or you have to unset AWS_CONTAINER_CREDENTIALS_RELATIVE_URI in your docker entrypoint.sh.