kubernetesargo-workflowsoptuna

Argo workflow's template variable for step's IP is not resolving


I am building an Argo workflow to execute a machine learning hyperparameter optimisation with optuna, based on this original workflow, that I found reading this medium post. The issue is that this workflow relies on the steps.<STEPNAME>.ip Argo's variable, but I can't seem to make it work as I expect.

I guess it's a very simple syntax error, or indentation, or whatever, as I did not see anyone having the same issue. But I just can't make any sense of it, at the moment.

I built this tiny workflow to reproduce the issue :

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: hyperparameter-tuning-test
  labels:
    workflows.argoproj.io/archive-strategy: always
    workflows.argoproj.io/controller-instanceid: my-ci-controller
    app: "ltr"
spec:
  serviceAccountName: argo
  entrypoint: hyperparameter-tuning
  templates:
    - name: hyperparameter-tuning
      steps:
        - - name: create-postgres
            template: create-postgres

        - - name: create-study-optuna
            template: create-study-optuna
            arguments:
              parameters:
                - name: postgres-ip
                  value: "{{steps.create-postgres.ip}}"

    - name: create-postgres
      daemon: true
      container:
        image: postgres
        resources:
          requests:
            cpu: 100m
            memory: 1Gi
          limits:
            cpu: 100m
            memory: 1Gi
        env:
          - name: POSTGRES_USER
            value: user
          - name: POSTGRES_PASSWORD
            value: pw
        readinessProbe:
          exec:
            command:
              - /bin/sh
              - -c
              - exec pg_isready -h 127.0.0.1 -p 5432

    - name: create-study-optuna
      inputs:
        parameters:
          - name: postgres-ip
      script:
        image: optuna/optuna:py3.10
        resources:
          requests:
            cpu: 200m
            memory: 1Gi
          limits:
            cpu: 200m
            memory: 1Gi
        command: [bash]
        source: |
          python -c '
          import optuna
          optuna.create_study(study_name="example_study", storage="postgresql://user:pw@{{inputs.parameters.postgres-ip}}:5432/postgres", direction="minimize")
          '

The error I get is :

sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) could not translate host name "{steps.create-postgres.ip}" to address: Name or service not known

Clearly indicating that the variable was not expanded correctly

I saw that Argo workflow's official getting started is relying on the killercoda platform. I personnaly used the cluster of my company, which has everything already set up for argo workflows. But just in case, you might be able to run this workflow in this free cluster : https://killercoda.com/argoproj/course/argo-workflows/getting-started


Solution

  • My issue was in fact a disk write permission issue of the database's container, which prevented the database to actually start. Weirdly, the job was marked as successful.
    The pod did not really have an ip address because it failed, meaning Argo could not expand the value for the next pod, and so it used the full string {{steps.<STEPNAME>.ip}}.
    I personally find that Argo did not handle this case correctly at all :

    This is the main answer to this stackoverflow question.

    For the sake of completeness, I also provide with the solution for the write access, even if it's not 100% related to the original question. I also had some issues with the probe, which I include in my response.

    To fix the issue if you are on read-only system like me :

            volumeMounts:
              - name: tmp
                mountPath: "/tmp"
              - name: var-lib
                mountPath: "/var/lib/postgresql"
              - name: var-run
                mountPath: "/var/run/postgresql"
          volumes:
            - name: tmp
              emptyDir: {}
            - name: var-lib
              emptyDir: {}
            - name: var-run
              emptyDir: {}
    

    Then make sure that the container's user its granted write rights to these folders :

    
            securityContext:
                fsGroup: 999
                runAsGroup: 999 
                runAsUser: 999 
    

    About the probe it seems the syntax proposed did not work on my stack, so you might have to change to something like this instead:

            readinessProbe:
              exec:
                command:
                  - "pg_isready"
                  - "-h"
                  - "127.0.0.1"
                  - "-p"
                  - "5432"
              timeoutSeconds: 30
    

    As you can see, I also had to add an explicit timeout, but this might just be related to my own cluster's efficiency

    Here is the final Argo template :

    apiVersion: argoproj.io/v1alpha1
    kind: WorkflowTemplate
    metadata:
      name: hyperparameter-tuning-test
      labels:
        workflows.argoproj.io/archive-strategy: always
        workflows.argoproj.io/controller-instanceid: my-ci-controller
        app: "ltr"
    spec:
      workflowMetadata:
        labels:
          workflows.argoproj.io/archive-strategy: always
      podGC:
        strategy: OnPodCompletion
      archiveLogs: true
      serviceAccountName: argo
      entrypoint: hyperparameter-tuning
      templates:
        - name: hyperparameter-tuning
          steps:
            - - name: create-postgres
                template: create-postgres
    
            - - name: create-study-optuna
                template: create-study-optuna
                arguments:
                  parameters:
                    - name: postgres-ip
                      value: "{{steps.create-postgres.ip}}"
    
        - name: create-postgres
          daemon: true
          container:
            image: postgres
            securityContext: # Grant write access to the volumes
                fsGroup: 999 # fs : filesystem
                runAsGroup: 999 # 999, usually 
                runAsUser: 999 #(to be sure, check : `docker run --rm -it postgres cat /etc/passwd`)
            resources:
              requests:
                cpu: 100m
                memory: 1Gi
              limits:
                cpu: 100m
                memory: 1Gi
            env:
              - name: POSTGRES_USER
                value: user
              - name: POSTGRES_PASSWORD
                value: pw
            readinessProbe:
              exec:
                command:
                  - "pg_isready"
                  - "-h"
                  - "127.0.0.1"
                  - "-p"
                  - "5432"
              timeoutSeconds: 30
            volumeMounts:
              - name: tmp
                mountPath: "/tmp"
              - name: var-lib
                mountPath: "/var/lib/postgresql"
              - name: var-run
                mountPath: "/var/run/postgresql"
          volumes:
            - name: tmp
              emptyDir: {}
            - name: var-lib
              emptyDir: {}
            - name: var-run
              emptyDir: {}
    
        - name: create-study-optuna
          inputs:
            parameters:
              - name: postgres-ip
          script:
            image: optuna/optuna:py3.10
            resources:
              requests:
                cpu: 200m
                memory: 1Gi
              limits:
                cpu: 200m
                memory: 1Gi
            command: [bash]
            source: |
              python -c '
              import optuna
              optuna.create_study(study_name="example_study", storage="postgresql://user:pw@{{inputs.parameters.postgres-ip}}:5432/postgres", direction="minimize")
              '
    

    Hope this might help !