Question
What should I use for --index-url
and --trusted-host
to install Python packages from AWS CodeArtifact in an AWS Glue (Spark) job, when using the method described in the section Setting up a CodeArtifact mirror of pypi attached to your VPC of this AWS Glue Developer Guide page?
Background
I am attempting to run an AWS Glue Spark job that needs to import private packages. The packages are available within a repo in AWS CodeArtifact. I am able to install the packages from an EC2; this question is about installing the package for use in AWS Glue.
I was able to install the packages by specifying the following job parameters:
--additional-python-modules
as the package name--python-modules-installer-option
as --index-url=https://aws:<codeartifact token>@<domain name>-<domain-owner>.d.codeartifact.<region>.com/pypi/<repo name>/simple/
Problem is that the codeartifact token expires every 12 hours. So I'm able to get it to work by manually generating the token and replacing it in the job param, but that's not ideal.
This AWS post suggests using AWS Step Functions to first get the token, use that to formulate the index URL, then trigger the job and pass it the index URL. That seems incredibly clunky!
So then I came across that page in the AWS Glue Developer guide which suggests setting up a VPC endpoint for CodeArtifact. But it's not clear where to get the URLs (which look like S3 URLs) used to specify --index-url
.
Has anybody solved this problem?
NOTE: while this question is specifically about what to use for --index-url
, I'm also open to other approaches to solve the underlying problem of how to install packages from CodeArtifact for use in an AWS Glue job.
CodeArtifact token expires every 12 hours
Correct, CodeArtifact tokens have a minimum validity of 15 minutes & a maximum validity of 12 hours. There's no 'workaround'.
This AWS post suggests using AWS Step Functions to first get the token, use that to formulate the index URL, then trigger the job and pass it the index URL. That seems incredibly clunky!
You've discovered the main pain point of using AWS CodeArtifact: there's no auto-refresh capability by default.
Other community workarounds, that don't apply to AWS Glue, have been creating custom Java Gradle plugins to auto refresh, scripts running on a Cron schedule, writing the token to a file & checking the last modified before pulling from the repo etc.
Currently, there isn't a native integration between AWS CodeArtifact & AWS Glue that manages the token lifecycle in the background.
If you want to use CodeArtifact, you must generate a new token at most every 12 hours.
With no native integration, the solution mentioned in the blog is a viable one.
So then I came across that page in the AWS Glue Developer guide which suggests setting up a VPC endpoint for CodeArtifact. But it's not clear where to get the URLs (which look like S3 URLs) used to specify
--index-url
.
AWS know about the lack of a native integration, hence the need for a Setting up an Amazon S3 bucket to host a targeted PyPI/simple repo section in the docs. Although, the docs are currently really, really, really bad.
In this scenario, you're not using CodeArtifact.
You're using s3pypi
to essentially host a Python Package Repository in an S3 bucket. Note there isn't a sync option here between CodeArtifact and the S3 bucket; you have to manually keep the bucket updated with the latest versions of your packages.
--index-url
is the 'base URL of the Python Package Index'.
--trusted-host
is used to 'mark this host or host:port pair as trusted, even though it does not have valid (certificate) or any HTTPS'.
In AWS terms, --index-url
needs to be set to the S3 static website endpoint & --trusted-host
needs to be set to the S3 static website endpoint without the scheme e.g. https://
.
And since Glue jobs can run in a VPC, referring to an answer like this should allow you to lock the package repository down to just your private VPC.