I am building an Argo workflow to execute a machine learning hyperparameter optimisation with optuna, based on this original workflow, that I found reading this medium post. The issue is that this workflow relies on the steps.<STEPNAME>.ip Argo's variable, but I can't seem to make it work as I expect.
{{steps.create-postgres.ip}}
I guess it's a very simple syntax error, or indentation, or whatever, as I did not see anyone having the same issue. But I just can't make any sense of it, at the moment.
I built this tiny workflow to reproduce the issue :
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
name: hyperparameter-tuning-test
labels:
workflows.argoproj.io/archive-strategy: always
workflows.argoproj.io/controller-instanceid: my-ci-controller
app: "ltr"
spec:
serviceAccountName: argo
entrypoint: hyperparameter-tuning
templates:
- name: hyperparameter-tuning
steps:
- - name: create-postgres
template: create-postgres
- - name: create-study-optuna
template: create-study-optuna
arguments:
parameters:
- name: postgres-ip
value: "{{steps.create-postgres.ip}}"
- name: create-postgres
daemon: true
container:
image: postgres
resources:
requests:
cpu: 100m
memory: 1Gi
limits:
cpu: 100m
memory: 1Gi
env:
- name: POSTGRES_USER
value: user
- name: POSTGRES_PASSWORD
value: pw
readinessProbe:
exec:
command:
- /bin/sh
- -c
- exec pg_isready -h 127.0.0.1 -p 5432
- name: create-study-optuna
inputs:
parameters:
- name: postgres-ip
script:
image: optuna/optuna:py3.10
resources:
requests:
cpu: 200m
memory: 1Gi
limits:
cpu: 200m
memory: 1Gi
command: [bash]
source: |
python -c '
import optuna
optuna.create_study(study_name="example_study", storage="postgresql://user:pw@{{inputs.parameters.postgres-ip}}:5432/postgres", direction="minimize")
'
The error I get is :
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) could not translate host name "{steps.create-postgres.ip}" to address: Name or service not known
Clearly indicating that the variable was not expanded correctly
I saw that Argo workflow's official getting started is relying on the killercoda platform. I personnaly used the cluster of my company, which has everything already set up for argo workflows. But just in case, you might be able to run this workflow in this free cluster : https://killercoda.com/argoproj/course/argo-workflows/getting-started
My issue was in fact a disk write permission issue of the database's container, which prevented the database to actually start. Weirdly, the job was marked as successful.
The pod did not really have an ip address because it failed, meaning Argo could not expand the value for the next pod, and so it used the full string {{steps.<STEPNAME>.ip}}
.
I personally find that Argo did not handle this case correctly at all :
the step's pod does not have an ip
.This is the main answer to this stackoverflow question.
For the sake of completeness, I also provide with the solution for the write access, even if it's not 100% related to the original question. I also had some issues with the probe, which I include in my response.
To fix the issue if you are on read-only system like me :
volumeMounts:
- name: tmp
mountPath: "/tmp"
- name: var-lib
mountPath: "/var/lib/postgresql"
- name: var-run
mountPath: "/var/run/postgresql"
volumes:
- name: tmp
emptyDir: {}
- name: var-lib
emptyDir: {}
- name: var-run
emptyDir: {}
Then make sure that the container's user its granted write rights to these folders :
securityContext:
fsGroup: 999
runAsGroup: 999
runAsUser: 999
About the probe it seems the syntax proposed did not work on my stack, so you might have to change to something like this instead:
readinessProbe:
exec:
command:
- "pg_isready"
- "-h"
- "127.0.0.1"
- "-p"
- "5432"
timeoutSeconds: 30
As you can see, I also had to add an explicit timeout, but this might just be related to my own cluster's efficiency
Here is the final Argo template :
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
name: hyperparameter-tuning-test
labels:
workflows.argoproj.io/archive-strategy: always
workflows.argoproj.io/controller-instanceid: my-ci-controller
app: "ltr"
spec:
workflowMetadata:
labels:
workflows.argoproj.io/archive-strategy: always
podGC:
strategy: OnPodCompletion
archiveLogs: true
serviceAccountName: argo
entrypoint: hyperparameter-tuning
templates:
- name: hyperparameter-tuning
steps:
- - name: create-postgres
template: create-postgres
- - name: create-study-optuna
template: create-study-optuna
arguments:
parameters:
- name: postgres-ip
value: "{{steps.create-postgres.ip}}"
- name: create-postgres
daemon: true
container:
image: postgres
securityContext: # Grant write access to the volumes
fsGroup: 999 # fs : filesystem
runAsGroup: 999 # 999, usually
runAsUser: 999 #(to be sure, check : `docker run --rm -it postgres cat /etc/passwd`)
resources:
requests:
cpu: 100m
memory: 1Gi
limits:
cpu: 100m
memory: 1Gi
env:
- name: POSTGRES_USER
value: user
- name: POSTGRES_PASSWORD
value: pw
readinessProbe:
exec:
command:
- "pg_isready"
- "-h"
- "127.0.0.1"
- "-p"
- "5432"
timeoutSeconds: 30
volumeMounts:
- name: tmp
mountPath: "/tmp"
- name: var-lib
mountPath: "/var/lib/postgresql"
- name: var-run
mountPath: "/var/run/postgresql"
volumes:
- name: tmp
emptyDir: {}
- name: var-lib
emptyDir: {}
- name: var-run
emptyDir: {}
- name: create-study-optuna
inputs:
parameters:
- name: postgres-ip
script:
image: optuna/optuna:py3.10
resources:
requests:
cpu: 200m
memory: 1Gi
limits:
cpu: 200m
memory: 1Gi
command: [bash]
source: |
python -c '
import optuna
optuna.create_study(study_name="example_study", storage="postgresql://user:pw@{{inputs.parameters.postgres-ip}}:5432/postgres", direction="minimize")
'
Hope this might help !