apache-sparkkuberneteskerberosargo-workflowsargo

KerberosAuthException in Argo for using PySpark


I tried to search a lot in the web and debug this issue, unfortunately in vain.

I have created a simple pyspark application (dockerized) which I am trying to run in Argo workflows. While the pyspark application just creates a dataframe and prints it (that is it), it runs properly when I deploy that manually in Kubernetes cluster. However when I run the same docker image in the same namespace and cluster in Kubernetes using argo workflows - I am getting this KerberosAuthException. Can anyone point me what to do? There is no use of kerberos in my application at all.

hello-world-hp2z4: py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
hello-world-hp2z4: : org.apache.hadoop.security.KerberosAuthException: failure to login: using ticket cache file: FILE:/tmp/krb5cc_0 javax.security.auth.login.LoginException: java.lang.NullPointerException: invalid null input: name
hello-world-hp2z4:      at jdk.security.auth/com.sun.security.auth.UnixPrincipal.<init>(UnixPrincipal.java:67)
hello-world-hp2z4:      at jdk.security.auth/com.sun.security.auth.module.UnixLoginModule.login(UnixLoginModule.java:134)
hello-world-hp2z4:      at java.base/javax.security.auth.login.LoginContext.invoke(LoginContext.java:755)
hello-world-hp2z4:      at java.base/javax.security.auth.login.LoginContext$4.run(LoginContext.java:679)
hello-world-hp2z4:      at java.base/javax.security.auth.login.LoginContext$4.run(LoginContext.java:677)
hello-world-hp2z4:      at java.base/java.security.AccessController.doPrivileged(AccessController.java:712)
hello-world-hp2z4:      at java.base/javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:677)

As I said, this is only happening when I am running it via argo. Otherwise the application runs perfectly standalone in Kubernetes. Any help is appreciated!

Pod description when manually running through k8s (successful):

Name:                 test-spark-pod
Namespace:            posas-accsecana-argowf-qa
Priority:             600000000
Priority Class Name:  application-default
Service Account:      default
Node:                 kworker-be-intg-iz1-bs017/10.242.8.5
Start Time:           Tue, 12 Nov 2024 10:21:42 +0100
Labels:               <none>
Annotations:          cni.projectcalico.org/containerID: a0957c48cfb01b4d155a2fa1a2ac52b269b1858085d8fae82cc05acba4bcf70b
                      cni.projectcalico.org/podIP: 100.67.81.137/32
                      cni.projectcalico.org/podIPs: 100.67.81.137/32
                      kubernetes.io/limit-ranger:
                        LimitRanger plugin set: cpu, ephemeral-storage, memory request for container test-spark-container01; cpu, ephemeral-storage, memory limit ...
Status:               Running
SeccompProfile:       RuntimeDefault
IP:                   100.67.81.137
IPs:
  IP:  100.67.81.137
Containers:
  test-spark-container01:
    Container ID:   containerd://bddd04b8311e340e8eef70747ddf12f0028553960d54d0f6a9608540e25eb124
    Image:          docker.mamdev.server.lan/internal/csu/ana/pyspark-kerberos:latest
    Image ID:       docker.mamdev.server.lan/internal/csu/ana/pyspark-kerberos@sha256:546428e6d40b9cee30e017da38c922a2e67390ab63161ed3dfa4f19000977b21
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 12 Nov 2024 10:33:56 +0100
      Finished:     Tue, 12 Nov 2024 10:34:06 +0100
    Ready:          False
    Restart Count:  7
    Limits:
      cpu:                2
      ephemeral-storage:  10Gi
      memory:             13Gi
    Requests:
      cpu:                200m
      ephemeral-storage:  300Mi
      memory:             1Gi
    Environment:          <none>
    Mounts:
      /app/tmp/spark from spark-tmp-volume (rw)
      /tmp from tmp-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-442gc (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  tmp-volume:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  spark-tmp-volume:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  kube-api-access-442gc:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  17m                   default-scheduler  Successfully assigned posas-accsecana-argowf-qa/test-spark-pod to kworker-be-intg-iz1-bs017
  Normal   Pulled     17m                   kubelet            Successfully pulled image "docker.mamdev.server.lan/internal/csu/ana/pyspark-kerberos:latest" in 69ms (69ms including waiting)
  Normal   Pulled     16m                   kubelet            Successfully pulled image "docker.mamdev.server.lan/internal/csu/ana/pyspark-kerberos:latest" in 77ms (77ms including waiting)
  Normal   Created    16m (x4 over 17m)     kubelet            Created container test-spark-container01
  Normal   Started    16m (x4 over 17m)     kubelet            Started container test-spark-container01
  Normal   Pulled     16m                   kubelet            Successfully pulled image "docker.mamdev.server.lan/internal/csu/ana/pyspark-kerberos:latest" in 56ms (56ms including waiting)
  Normal   Pulling    15m (x5 over 17m)     kubelet            Pulling image "docker.mamdev.server.lan/internal/csu/ana/pyspark-kerberos:latest"
  Normal   Pulled     15m (x2 over 17m)     kubelet            Successfully pulled image "docker.mamdev.server.lan/internal/csu/ana/pyspark-kerberos:latest" in 70ms (70ms including waiting)
  Warning  BackOff    2m22s (x63 over 17m)  kubelet            Back-off restarting failed container test-spark-container01 in pod test-spark-pod_posas-accsecana-argowf-qa(f0fedf94-1a04-449e-a298-449bb356292b)

Pod desc when running through argo-wf (Exception)

Name:                 hello-world-bzzkf
Namespace:            posas-accsecana-argowf-qa
Priority:             600000000
Priority Class Name:  application-default
Service Account:      default
Node:                 kworker-be-intg-iz1-bs017/10.242.8.5
Start Time:           Tue, 12 Nov 2024 09:47:50 +0100
Labels:               mam_brand=any
                      mam_dc=bs
                      mam_stage=qa
                      workflows.argoproj.io/completed=true
                      workflows.argoproj.io/controller-instanceid=posas-accsecana-argowf-qa
                      workflows.argoproj.io/workflow=hello-world-bzzkf
Annotations:          cni.projectcalico.org/containerID: 79c1431e821c7bc1166c10eed57f85a7242d59de0b49d735ad3f19efafc98649
                      cni.projectcalico.org/podIP: 
                      cni.projectcalico.org/podIPs: 
                      kubectl.kubernetes.io/default-container: main
                      kubernetes.io/limit-ranger:
                        LimitRanger plugin set: ephemeral-storage request for container wait; ephemeral-storage limit for container wait; cpu, ephemeral-storage, ...
                      workflows.argoproj.io/node-id: hello-world-bzzkf
                      workflows.argoproj.io/node-name: hello-world-bzzkf
Status:               Failed
SeccompProfile:       RuntimeDefault
IP:                   100.67.81.91
IPs:
  IP:           100.67.81.91
Controlled By:  Workflow/hello-world-bzzkf
Init Containers:
  init:
    Container ID:  containerd://37acc8242db4c6bf9143b06b003760a40bd2f4165e15929f78569bf75cde4ece
    Image:         cr.mam.dev/internal/mf/commons/argoexec:latest
    Image ID:      cr.mam.dev/internal/mf/commons/argoexec@sha256:20a7f519ee4d825e5ae4d2693e7fb69f6f16f64fcab605b6400b86afb1a78362
    Port:          <none>
    Host Port:     <none>
    Command:
      argoexec
      init
      --loglevel
      info
      --log-format
      text
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 12 Nov 2024 09:47:52 +0100
      Finished:     Tue, 12 Nov 2024 09:47:52 +0100
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:                500m
      ephemeral-storage:  10Gi
      memory:             512Mi
    Requests:
      cpu:                500m
      ephemeral-storage:  300Mi
      memory:             512Mi
    Environment:
      ARGO_POD_NAME:                      hello-world-bzzkf (v1:metadata.name)
      ARGO_POD_UID:                        (v1:metadata.uid)
      GODEBUG:                            x509ignoreCN=0
      ARGO_WORKFLOW_NAME:                 hello-world-bzzkf
      ARGO_WORKFLOW_UID:                  eb02ba92-2b43-4c64-9f1e-85747bb27a34
      ARGO_INSTANCE_ID:                   posas-accsecana-argowf-qa
      ARGO_CONTAINER_NAME:                init
      ARGO_TEMPLATE:                      {"name":"whalesay","inputs":{},"outputs":{},"metadata":{"labels":{"mam_brand":"any","mam_dc":"bs","mam_stage":"qa"}},"container":{"name":"","image":"docker.mamdev.server.lan/internal/csu/ana/pyspark-kerberos:latest","command":["python3","pyspark_script.py"],"resources":{},"volumeMounts":[{"name":"tmp-volume","mountPath":"/tmp"},{"name":"spark-tmp-volume","mountPath":"/app/tmp/spark"}]}}
      ARGO_NODE_ID:                       hello-world-bzzkf
      ARGO_INCLUDE_SCRIPT_OUTPUT:         false
      ARGO_DEADLINE:                      0001-01-01T00:00:00Z
      ARGO_PROGRESS_FILE:                 /var/run/argo/progress
      ARGO_PROGRESS_PATCH_TICK_DURATION:  1m0s
      ARGO_PROGRESS_FILE_TICK_DURATION:   3s
    Mounts:
      /var/run/argo from var-run-argo (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-pqw7g (ro)
Containers:
  wait:
    Container ID:  containerd://b742467b6b267cc6c82c2a02b821d980491a7113c6cebaaeeb4243cf9fd9f480
    Image:         cr.mam.dev/internal/mf/commons/argoexec:latest
    Image ID:      cr.mam.dev/internal/mf/commons/argoexec@sha256:20a7f519ee4d825e5ae4d2693e7fb69f6f16f64fcab605b6400b86afb1a78362
    Port:          <none>
    Host Port:     <none>
    Command:
      argoexec
      wait
      --loglevel
      info
      --log-format
      text
    State:          Terminated
      Reason:       Error
      Message:      pods "hello-world-bzzkf" is forbidden: User "system:serviceaccount:posas-accsecana-argowf-qa:default" cannot patch resource "pods" in API group "" in the namespace "posas-accsecana-argowf-qa"
      Exit Code:    1
      Started:      Tue, 12 Nov 2024 09:47:53 +0100
      Finished:     Tue, 12 Nov 2024 09:47:58 +0100
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:                500m
      ephemeral-storage:  10Gi
      memory:             512Mi
    Requests:
      cpu:                500m
      ephemeral-storage:  300Mi
      memory:             512Mi
    Environment:
      ARGO_POD_NAME:                      hello-world-bzzkf (v1:metadata.name)
      ARGO_POD_UID:                        (v1:metadata.uid)
      GODEBUG:                            x509ignoreCN=0
      ARGO_WORKFLOW_NAME:                 hello-world-bzzkf
      ARGO_WORKFLOW_UID:                  eb02ba92-2b43-4c64-9f1e-85747bb27a34
      ARGO_INSTANCE_ID:                   posas-accsecana-argowf-qa
      ARGO_CONTAINER_NAME:                wait
      ARGO_TEMPLATE:                      {"name":"whalesay","inputs":{},"outputs":{},"metadata":{"labels":{"mam_brand":"any","mam_dc":"bs","mam_stage":"qa"}},"container":{"name":"","image":"docker.mamdev.server.lan/internal/csu/ana/pyspark-kerberos:latest","command":["python3","pyspark_script.py"],"resources":{},"volumeMounts":[{"name":"tmp-volume","mountPath":"/tmp"},{"name":"spark-tmp-volume","mountPath":"/app/tmp/spark"}]}}
      ARGO_NODE_ID:                       hello-world-bzzkf
      ARGO_INCLUDE_SCRIPT_OUTPUT:         false
      ARGO_DEADLINE:                      0001-01-01T00:00:00Z
      ARGO_PROGRESS_FILE:                 /var/run/argo/progress
      ARGO_PROGRESS_PATCH_TICK_DURATION:  1m0s
      ARGO_PROGRESS_FILE_TICK_DURATION:   3s
    Mounts:
      /mainctrfs/app/tmp/spark from spark-tmp-volume (rw)
      /mainctrfs/tmp from tmp-volume (rw)
      /tmp from tmp-dir-argo (rw,path="0")
      /var/run/argo from var-run-argo (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-pqw7g (ro)
  main:
    Container ID:  containerd://bf8da42bcf0b4a210e5fbb8206d2a11660641c2bf7c06f1adebe00c0d04e122b
    Image:         docker.mamdev.server.lan/internal/csu/ana/pyspark-kerberos:latest
    Image ID:      docker.mamdev.server.lan/internal/csu/ana/pyspark-kerberos@sha256:546428e6d40b9cee30e017da38c922a2e67390ab63161ed3dfa4f19000977b21
    Port:          <none>
    Host Port:     <none>
    Command:
      /var/run/argo/argoexec
      emissary
      --loglevel
      info
      --log-format
      text
      --
      python3
      pyspark_script.py
    State:          Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 12 Nov 2024 09:47:54 +0100
      Finished:     Tue, 12 Nov 2024 09:47:57 +0100
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:                2
      ephemeral-storage:  10Gi
      memory:             13Gi
    Requests:
      cpu:                200m
      ephemeral-storage:  300Mi
      memory:             1Gi
    Environment:
      ARGO_CONTAINER_NAME:                main
      ARGO_TEMPLATE:                      {"name":"whalesay","inputs":{},"outputs":{},"metadata":{"labels":{"mam_brand":"any","mam_dc":"bs","mam_stage":"qa"}},"container":{"name":"","image":"docker.mamdev.server.lan/internal/csu/ana/pyspark-kerberos:latest","command":["python3","pyspark_script.py"],"resources":{},"volumeMounts":[{"name":"tmp-volume","mountPath":"/tmp"},{"name":"spark-tmp-volume","mountPath":"/app/tmp/spark"}]}}
      ARGO_NODE_ID:                       hello-world-bzzkf
      ARGO_INCLUDE_SCRIPT_OUTPUT:         false
      ARGO_DEADLINE:                      0001-01-01T00:00:00Z
      ARGO_PROGRESS_FILE:                 /var/run/argo/progress
      ARGO_PROGRESS_PATCH_TICK_DURATION:  1m0s
      ARGO_PROGRESS_FILE_TICK_DURATION:   3s
    Mounts:
      /app/tmp/spark from spark-tmp-volume (rw)
      /tmp from tmp-volume (rw)
      /var/run/argo from var-run-argo (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-pqw7g (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   False 
  Initialized                 True 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  var-run-argo:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  tmp-dir-argo:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  tmp-volume:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  spark-tmp-volume:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  kube-api-access-pqw7g:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age    From               Message
  ----    ------     ----   ----               -------
  Normal  Scheduled  2m57s  default-scheduler  Successfully assigned posas-accsecana-argowf-qa/hello-world-bzzkf to kworker-be-intg-iz1-bs017
  Normal  Pulling    2m57s  kubelet            Pulling image "cr.mam.dev/internal/mf/commons/argoexec:latest"
  Normal  Pulled     2m56s  kubelet            Successfully pulled image "cr.mam.dev/internal/mf/commons/argoexec:latest" in 299ms (299ms including waiting)
  Normal  Created    2m56s  kubelet            Created container init
  Normal  Started    2m56s  kubelet            Started container init
  Normal  Pulling    2m55s  kubelet            Pulling image "cr.mam.dev/internal/mf/commons/argoexec:latest"
  Normal  Pulled     2m55s  kubelet            Successfully pulled image "cr.mam.dev/internal/mf/commons/argoexec:latest" in 98ms (98ms including waiting)
  Normal  Created    2m55s  kubelet            Created container wait
  Normal  Started    2m55s  kubelet            Started container wait
  Normal  Pulling    2m55s  kubelet            Pulling image "docker.mamdev.server.lan/internal/csu/ana/pyspark-kerberos:latest"
  Normal  Pulled     2m54s  kubelet            Successfully pulled image "docker.mamdev.server.lan/internal/csu/ana/pyspark-kerberos:latest" in 93ms (93ms including waiting)
  Normal  Created    2m54s  kubelet            Created container main
  Normal  Started    2m54s  kubelet            Started container main

Python code:

# pyspark_script.py
import os
from pyspark.sql import SparkSession
print("Starting PySpark application...")
print(os.environ['JAVA_HOME'])

# Create a Spark session
spark = SparkSession.builder \
    .appName('pyspark-kerberos') \
    .master('local[2]') \
    .config('spark.executor.instances', 2) \
    .config('spark.executor.cores', 2) \
    .config('spark.executor.memory', '5g') \
    .config("spark.jars.ivy", "/app/tmp/spark")\
    .getOrCreate()

# spark.sparkContext.setLogLevel("DEBUG")

# Sample DataFrame
data = [("Alice", 29), ("Bob", 31), ("Cathy", 27)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)

# Print the DataFrame
df.show()

# Stop the Spark session
spark.stop()

Solution

  • I could solve this problem, hence posting the answer here too:

    It was indeed the problem with unix principal - spark running in docker containers. However I had already tried adding in docker a username and also other suggestions in stack- but nothing seemed to work.

    I got a hint to solve the problem by reading this, it looked like the argo container was not able to provide a username to the spark-docker. Therefore I added these in to argo yml under templates container:

     securityContext:
              runAsUser: 1000
              runAsGroup: 3000
    

    and Voila! It went passed the KerberosAuthException