apache-sparkkubernetespysparkgoogle-kubernetes-engine

Apache Spark on k8s (GKE) - files copied to /opt/spark/work-dir not showing up in deployment


I've have Apache Spark deployed on kubernetes (GKE), and I've created a Docker image with the required files copied to location -> /opt/spark/work-dir

When i logon to the Docker image, I can see the files loaded.

However, on deploying the Docker image .. i dont see the contents in the location /opt/spark/work-dir

Here are the details :

Dockerfile

# Use an official Apache Spark base image
FROM apache/spark:3.5.0

# Switch to root user to install additional dependencies
USER root

# Set the working directory
WORKDIR /opt/spark/work-dir

# Install necessary tools

# Set environment variables
ENV SPARK_HOME /opt/spark
ENV PATH $PATH:$SPARK_HOME/bin

RUN mkdir -p /opt/spark/custom-dir

# Copy your application files
COPY main.py .

COPY streams.zip .
COPY utils.zip .

COPY gcs-connector-hadoop3-latest.jar /opt/spark/jars/

# Copy configuration files
COPY log4j-driver.properties /opt/spark/conf/
COPY log4j-executor.properties /opt/spark/conf/

COPY params.cfg .
COPY params_password.cfg .

COPY kafka-certs/*.p12 .

COPY jars/* .

# Set correct permissions
RUN chown -R spark:spark /opt/spark/work-dir /opt/spark/custom-dir /opt/spark/conf /opt/spark/jars && \
    chmod -R 755 /opt/spark/work-dir /opt/spark/conf /opt/spark/jars /opt/spark/custom-dir

# Switch back to spark user
USER spark

Logon to Docker image to check the files :

(base) Karans-MacBook-Pro:spark-k8s-operator karanalang$ docker run -it --entrypoint /bin/bash --platform linux/amd64 us-east1-docker.pkg.dev/versa-kafka-poc/spark-job-repo/ss-main-vkp:0.0.2
spark@7c5a91a4c77c:/opt/spark/work-dir$ pwd
/opt/spark/work-dir
spark@7c5a91a4c77c:/opt/spark/work-dir$ ls -lrt
total 3112
-rwxr-xr-x 1 spark spark  206041 Sep 24 22:21 spark-avro_2.12-3.5.0.jar
-rwxr-xr-x 1 spark spark   17299 Sep 24 22:27 main.py
-rwxr-xr-x 1 spark spark    2954 Sep 24 22:30 alarm-compression-user-test.p12
-rwxr-xr-x 1 spark spark    2908 Sep 24 22:30 alarmblock-user-test.p12
-rwxr-xr-x 1 spark spark    2902 Sep 24 22:30 anomaly-test-user.p12
-rwxr-xr-x 1 spark spark    2926 Sep 24 22:30 appstat-agg-user-test.p12
-rwxr-xr-x 1 spark spark    2934 Sep 24 22:30 appstat-anomaly-user-test.p12
-rwxr-xr-x 1 spark spark    2904 Sep 24 22:30 appstats-user-test.p12
-rwxr-xr-x 1 spark spark    2904 Sep 24 22:30 insights-user-test.p12
-rwxr-xr-x 1 spark spark    2904 Sep 24 22:30 intfutil-user-test.p12
-rwxr-xr-x 1 spark spark    2900 Sep 24 22:30 issues-test-user.p12
-rwxr-xr-x 1 spark spark    2952 Sep 24 22:30 versa-alarmblock-test-user.p12
-rwxr-xr-x 1 spark spark    2930 Sep 24 22:30 versa-appstat-test-user.p12
-rwxr-xr-x 1 spark spark    2934 Sep 24 22:30 versa-bandwidth-test-user.p12
-rwxr-xr-x 1 spark spark    1702 Sep 24 22:30 vkp-test-tf-ca.p12
-rwxr-xr-x 1 spark spark    1702 Sep 24 22:32 versa-kafka-poc-tf-ca.p12
-rwxr-xr-x 1 spark spark    2904 Sep 24 22:34 syslog-vani-prefix.p12
-rwxr-xr-x 1 spark spark    6313 Sep 24 22:36 params.cfg
-rwxr-xr-x 1 spark spark    2550 Sep 24 22:37 params_password.cfg
-rwxr-xr-x 1 spark spark   69057 Sep 25 19:16 streams.zip
-rwxr-xr-x 1 spark spark    7064 Sep 25 19:16 utils.zip
-rwxr-xr-x 1 spark spark  553254 Sep 25 22:04 mongo-spark-connector_2.12-3.0.2.jar
-rwxr-xr-x 1 spark spark  491942 Sep 25 22:04 bson-4.0.5.jar
-rwxr-xr-x 1 spark spark  136522 Sep 25 22:04 mongodb-driver-sync-4.0.5.jar
-rwxr-xr-x 1 spark spark 1613403 Sep 25 22:05 mongodb-driver-core-4.0.5.jar

yaml for deploying the Image :

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: structured-streaming-main-{{ now | unixEpoch }}
  namespace: {{ .Values.namespace }}
spec:
  type: Python
  mode: cluster
  image: "us-east1-docker.pkg.dev/versa-kafka-poc/spark-job-repo/ss-main-vkp:0.0.2"
  imagePullPolicy: Always
  imagePullSecrets:
    - {{ .Values.imagePullSecret }}
  mainApplicationFile: "{{ .Values.mainApplicationFile }}"
  sparkVersion: "{{ .Values.sparkVersion }}"
  restartPolicy:
    type: OnFailure
    onFailureRetries: 3
    onFailureRetryInterval: 10
    onSubmissionFailureRetries: 5
    onSubmissionFailureRetryInterval: 20
  driver:
    cores: {{ .Values.driver.cores }}
    coreLimit: "{{ .Values.driver.coreLimit }}"
    memory: "{{ .Values.driver.memory }}"
    labels:
      version: "{{ .Values.sparkVersion }}"
    serviceAccount: spark
    volumeMounts:
      - name: gcs-key
        mountPath: /etc/secrets
        readOnly: true
      - name: work-dir
        mountPath: /opt/spark/work-dir
    initContainers:
      - name: init-check
        image: "us-east1-docker.pkg.dev/versa-kafka-poc/spark-job-repo/ss-main-vkp:0.0.2"
        command: ["sh", "-c", "ls -lrt /opt/spark/work-dir/"]
        volumeMounts:
          - name: work-dir
            mountPath: /opt/spark/work-dir
  executor:
    cores: {{ .Values.executor.cores }}
    instances: {{ .Values.executor.instances }}
    memory: "{{ .Values.executor.memory }}"
    labels:
      version: "{{ .Values.sparkVersion }}"
    volumeMounts:
      - name: gcs-key
        mountPath: /etc/secrets
        readOnly: true
      - name: work-dir
        mountPath: /opt/spark/work-dir
  volumes:
    - name: gcs-key
      secret:
        secretName: {{ .Values.gcsKeySecret }}
    # - name: custom-dir
    - name: work-dir    
  deps:
    jars:
      - local:///opt/spark/jars/gcs-connector-hadoop3-latest.jar
    pyFiles:
      - local:///opt/spark/work-dir/streams.zip
      - local:///opt/spark/work-dir/utils.zip
  pythonVersion: "{{ .Values.pythonVersion }}"
  sparkConf:
    "spark.kubernetes.namespace": "spark-operator"
    "spark.kubernetes.authenticate.driver.serviceAccountName": "spark"
    "spark.hadoop.fs.gs.impl": "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem"
    "spark.hadoop.fs.AbstractFileSystem.gs.impl": "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS"
    "spark.hadoop.google.cloud.auth.service.account.enable": "true"
    "spark.hadoop.google.cloud.auth.service.account.json.keyfile": "/etc/secrets/spark-gcs-key.json"
    "spark.eventLog.enabled": "true"
    "spark.eventLog.dir": "{{ .Values.sparkEventLogDir }}"
    "spark.hadoop.fs.gs.auth.type": "SERVICE_ACCOUNT_JSON_KEYFILE"
  hadoopConf:
    "fs.gs.impl": "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem"
    "fs.AbstractFileSystem.gs.impl": "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS"
    "google.cloud.auth.service.account.enable": "true"
    "google.cloud.auth.service.account.json.keyfile": "/etc/secrets/spark-gcs-key.json"

Error i get :


(base) Karans-MacBook-Pro:spark-k8s-operator karanalang$ kc logs -f structured-streaming-main-1727333392-driver -n spark-operator -c init-check
total 0

(base) Karans-MacBook-Pro:spark-k8s-operator karanalang$ kc logs -f  svc/structured-streaming-main-1727333392-ui-svc -n spark-operator
Files local:///opt/spark/jars/gcs-connector-hadoop3-latest.jar from /opt/spark/jars/gcs-connector-hadoop3-latest.jar to /opt/spark/work-dir/gcs-connector-hadoop3-latest.jar
Files local:///opt/spark/work-dir/streams.zip from /opt/spark/work-dir/streams.zip to /opt/spark/work-dir/streams.zip
Exception in thread "main" java.nio.file.NoSuchFileException: /opt/spark/work-dir/streams.zip
    at java.base/sun.nio.fs.UnixException.translateToIOException(Unknown Source)
    at java.base/sun.nio.fs.UnixException.rethrowAsIOException(Unknown Source)
    at java.base/sun.nio.fs.UnixException.rethrowAsIOException(Unknown Source)
    at java.base/sun.nio.fs.UnixCopyFile.copy(Unknown Source)
    at java.base/sun.nio.fs.UnixFileSystemProvider.copy(Unknown Source)
    at java.base/java.nio.file.Files.copy(Unknown Source)
    at org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$14(SparkSubmit.scala:441)
    at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
    at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
    at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
    at scala.collection.TraversableLike.map(TraversableLike.scala:286)
    at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
    at scala.collection.AbstractTraversable.map(Traversable.scala:108)
    at org.apache.spark.deploy.SparkSubmit.downloadResourcesToCurrentDirectory$1(SparkSubmit.scala:429)
    at org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$19(SparkSubmit.scala:459)

How to debug/fix this ?

tia!


Solution

  • copying the file to a different location eg /opt/spark/custom-dir resolved the issue, /opt/spark/work-dir is the work directory used and hence being blanked out when the pod is created