Argo workflow's template variable for step's IP is not resolving

I am building an Argo workflow to execute a machine learning hyperparameter optimisation with optuna, based on this original workflow, that I found reading this medium post. The issue is that this workflow relies on the steps.<STEPNAME>.ip Argo's variable, but I can't seem to make it work as I expect.

It is supposed to provide the next steps with the ip adress of the pod running the database, so that they can read/write data inside it.
Instead it is just a plain string : {{steps.create-postgres.ip}}

I guess it's a very simple syntax error, or indentation, or whatever, as I did not see anyone having the same issue. But I just can't make any sense of it, at the moment.

I built this tiny workflow to reproduce the issue :

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: hyperparameter-tuning-test
  labels:
    workflows.argoproj.io/archive-strategy: always
    workflows.argoproj.io/controller-instanceid: my-ci-controller
    app: "ltr"
spec:
  serviceAccountName: argo
  entrypoint: hyperparameter-tuning
  templates:
    - name: hyperparameter-tuning
      steps:
        - - name: create-postgres
            template: create-postgres

        - - name: create-study-optuna
            template: create-study-optuna
            arguments:
              parameters:
                - name: postgres-ip
                  value: "{{steps.create-postgres.ip}}"

    - name: create-postgres
      daemon: true
      container:
        image: postgres
        resources:
          requests:
            cpu: 100m
            memory: 1Gi
          limits:
            cpu: 100m
            memory: 1Gi
        env:
          - name: POSTGRES_USER
            value: user
          - name: POSTGRES_PASSWORD
            value: pw
        readinessProbe:
          exec:
            command:
              - /bin/sh
              - -c
              - exec pg_isready -h 127.0.0.1 -p 5432

    - name: create-study-optuna
      inputs:
        parameters:
          - name: postgres-ip
      script:
        image: optuna/optuna:py3.10
        resources:
          requests:
            cpu: 200m
            memory: 1Gi
          limits:
            cpu: 200m
            memory: 1Gi
        command: [bash]
        source: |
          python -c '
          import optuna
          optuna.create_study(study_name="example_study", storage="postgresql://user:pw@{{inputs.parameters.postgres-ip}}:5432/postgres", direction="minimize")
          '

The error I get is :

sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) could not translate host name "{steps.create-postgres.ip}" to address: Name or service not known

Clearly indicating that the variable was not expanded correctly

I saw that Argo workflow's official getting started is relying on the killercoda platform. I personnaly used the cluster of my company, which has everything already set up for argo workflows. But just in case, you might be able to run this workflow in this free cluster : https://killercoda.com/argoproj/course/argo-workflows/getting-started

Solution

My issue was in fact a disk write permission issue of the database's container, which prevented the database to actually start. Weirdly, the job was marked as successful.
The pod did not really have an ip address because it failed, meaning Argo could not expand the value for the next pod, and so it used the full string {{steps.<STEPNAME>.ip}}.
I personally find that Argo did not handle this case correctly at all :

Firstly, the failed database should not be marked as successfull.
Secondly, a failed-to-expand variable should not fallback to a simple string, but the real error should be raised : in this case something like the step's pod does not have an ip.

This is the main answer to this stackoverflow question.

For the sake of completeness, I also provide with the solution for the write access, even if it's not 100% related to the original question. I also had some issues with the probe, which I include in my response.

To fix the issue if you are on read-only system like me :

First add a volume for the pod to write to :

        volumeMounts:
          - name: tmp
            mountPath: "/tmp"
          - name: var-lib
            mountPath: "/var/lib/postgresql"
          - name: var-run
            mountPath: "/var/run/postgresql"
      volumes:
        - name: tmp
          emptyDir: {}
        - name: var-lib
          emptyDir: {}
        - name: var-run
          emptyDir: {}

Then make sure that the container's user its granted write rights to these folders :


        securityContext:
            fsGroup: 999
            runAsGroup: 999 
            runAsUser: 999

About the probe it seems the syntax proposed did not work on my stack, so you might have to change to something like this instead:

        readinessProbe:
          exec:
            command:
              - "pg_isready"
              - "-h"
              - "127.0.0.1"
              - "-p"
              - "5432"
          timeoutSeconds: 30

As you can see, I also had to add an explicit timeout, but this might just be related to my own cluster's efficiency

Here is the final Argo template :

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: hyperparameter-tuning-test
  labels:
    workflows.argoproj.io/archive-strategy: always
    workflows.argoproj.io/controller-instanceid: my-ci-controller
    app: "ltr"
spec:
  workflowMetadata:
    labels:
      workflows.argoproj.io/archive-strategy: always
  podGC:
    strategy: OnPodCompletion
  archiveLogs: true
  serviceAccountName: argo
  entrypoint: hyperparameter-tuning
  templates:
    - name: hyperparameter-tuning
      steps:
        - - name: create-postgres
            template: create-postgres

        - - name: create-study-optuna
            template: create-study-optuna
            arguments:
              parameters:
                - name: postgres-ip
                  value: "{{steps.create-postgres.ip}}"

    - name: create-postgres
      daemon: true
      container:
        image: postgres
        securityContext: # Grant write access to the volumes
            fsGroup: 999 # fs : filesystem
            runAsGroup: 999 # 999, usually 
            runAsUser: 999 #(to be sure, check : `docker run --rm -it postgres cat /etc/passwd`)
        resources:
          requests:
            cpu: 100m
            memory: 1Gi
          limits:
            cpu: 100m
            memory: 1Gi
        env:
          - name: POSTGRES_USER
            value: user
          - name: POSTGRES_PASSWORD
            value: pw
        readinessProbe:
          exec:
            command:
              - "pg_isready"
              - "-h"
              - "127.0.0.1"
              - "-p"
              - "5432"
          timeoutSeconds: 30
        volumeMounts:
          - name: tmp
            mountPath: "/tmp"
          - name: var-lib
            mountPath: "/var/lib/postgresql"
          - name: var-run
            mountPath: "/var/run/postgresql"
      volumes:
        - name: tmp
          emptyDir: {}
        - name: var-lib
          emptyDir: {}
        - name: var-run
          emptyDir: {}

    - name: create-study-optuna
      inputs:
        parameters:
          - name: postgres-ip
      script:
        image: optuna/optuna:py3.10
        resources:
          requests:
            cpu: 200m
            memory: 1Gi
          limits:
            cpu: 200m
            memory: 1Gi
        command: [bash]
        source: |
          python -c '
          import optuna
          optuna.create_study(study_name="example_study", storage="postgresql://user:pw@{{inputs.parameters.postgres-ip}}:5432/postgres", direction="minimize")
          '

Hope this might help !