I've have Apache Spark deployed on kubernetes (GKE), and I've created a Docker image with the required files copied to location -> /opt/spark/work-dir
When i logon to the Docker image, I can see the files loaded.
However, on deploying the Docker image .. i dont see the contents in the location /opt/spark/work-dir
Here are the details :
Dockerfile
# Use an official Apache Spark base image
FROM apache/spark:3.5.0
# Switch to root user to install additional dependencies
USER root
# Set the working directory
WORKDIR /opt/spark/work-dir
# Install necessary tools
# Set environment variables
ENV SPARK_HOME /opt/spark
ENV PATH $PATH:$SPARK_HOME/bin
RUN mkdir -p /opt/spark/custom-dir
# Copy your application files
COPY main.py .
COPY streams.zip .
COPY utils.zip .
COPY gcs-connector-hadoop3-latest.jar /opt/spark/jars/
# Copy configuration files
COPY log4j-driver.properties /opt/spark/conf/
COPY log4j-executor.properties /opt/spark/conf/
COPY params.cfg .
COPY params_password.cfg .
COPY kafka-certs/*.p12 .
COPY jars/* .
# Set correct permissions
RUN chown -R spark:spark /opt/spark/work-dir /opt/spark/custom-dir /opt/spark/conf /opt/spark/jars && \
chmod -R 755 /opt/spark/work-dir /opt/spark/conf /opt/spark/jars /opt/spark/custom-dir
# Switch back to spark user
USER spark
Logon to Docker image to check the files :
(base) Karans-MacBook-Pro:spark-k8s-operator karanalang$ docker run -it --entrypoint /bin/bash --platform linux/amd64 us-east1-docker.pkg.dev/versa-kafka-poc/spark-job-repo/ss-main-vkp:0.0.2
spark@7c5a91a4c77c:/opt/spark/work-dir$ pwd
/opt/spark/work-dir
spark@7c5a91a4c77c:/opt/spark/work-dir$ ls -lrt
total 3112
-rwxr-xr-x 1 spark spark 206041 Sep 24 22:21 spark-avro_2.12-3.5.0.jar
-rwxr-xr-x 1 spark spark 17299 Sep 24 22:27 main.py
-rwxr-xr-x 1 spark spark 2954 Sep 24 22:30 alarm-compression-user-test.p12
-rwxr-xr-x 1 spark spark 2908 Sep 24 22:30 alarmblock-user-test.p12
-rwxr-xr-x 1 spark spark 2902 Sep 24 22:30 anomaly-test-user.p12
-rwxr-xr-x 1 spark spark 2926 Sep 24 22:30 appstat-agg-user-test.p12
-rwxr-xr-x 1 spark spark 2934 Sep 24 22:30 appstat-anomaly-user-test.p12
-rwxr-xr-x 1 spark spark 2904 Sep 24 22:30 appstats-user-test.p12
-rwxr-xr-x 1 spark spark 2904 Sep 24 22:30 insights-user-test.p12
-rwxr-xr-x 1 spark spark 2904 Sep 24 22:30 intfutil-user-test.p12
-rwxr-xr-x 1 spark spark 2900 Sep 24 22:30 issues-test-user.p12
-rwxr-xr-x 1 spark spark 2952 Sep 24 22:30 versa-alarmblock-test-user.p12
-rwxr-xr-x 1 spark spark 2930 Sep 24 22:30 versa-appstat-test-user.p12
-rwxr-xr-x 1 spark spark 2934 Sep 24 22:30 versa-bandwidth-test-user.p12
-rwxr-xr-x 1 spark spark 1702 Sep 24 22:30 vkp-test-tf-ca.p12
-rwxr-xr-x 1 spark spark 1702 Sep 24 22:32 versa-kafka-poc-tf-ca.p12
-rwxr-xr-x 1 spark spark 2904 Sep 24 22:34 syslog-vani-prefix.p12
-rwxr-xr-x 1 spark spark 6313 Sep 24 22:36 params.cfg
-rwxr-xr-x 1 spark spark 2550 Sep 24 22:37 params_password.cfg
-rwxr-xr-x 1 spark spark 69057 Sep 25 19:16 streams.zip
-rwxr-xr-x 1 spark spark 7064 Sep 25 19:16 utils.zip
-rwxr-xr-x 1 spark spark 553254 Sep 25 22:04 mongo-spark-connector_2.12-3.0.2.jar
-rwxr-xr-x 1 spark spark 491942 Sep 25 22:04 bson-4.0.5.jar
-rwxr-xr-x 1 spark spark 136522 Sep 25 22:04 mongodb-driver-sync-4.0.5.jar
-rwxr-xr-x 1 spark spark 1613403 Sep 25 22:05 mongodb-driver-core-4.0.5.jar
yaml for deploying the Image :
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
name: structured-streaming-main-{{ now | unixEpoch }}
namespace: {{ .Values.namespace }}
spec:
type: Python
mode: cluster
image: "us-east1-docker.pkg.dev/versa-kafka-poc/spark-job-repo/ss-main-vkp:0.0.2"
imagePullPolicy: Always
imagePullSecrets:
- {{ .Values.imagePullSecret }}
mainApplicationFile: "{{ .Values.mainApplicationFile }}"
sparkVersion: "{{ .Values.sparkVersion }}"
restartPolicy:
type: OnFailure
onFailureRetries: 3
onFailureRetryInterval: 10
onSubmissionFailureRetries: 5
onSubmissionFailureRetryInterval: 20
driver:
cores: {{ .Values.driver.cores }}
coreLimit: "{{ .Values.driver.coreLimit }}"
memory: "{{ .Values.driver.memory }}"
labels:
version: "{{ .Values.sparkVersion }}"
serviceAccount: spark
volumeMounts:
- name: gcs-key
mountPath: /etc/secrets
readOnly: true
- name: work-dir
mountPath: /opt/spark/work-dir
initContainers:
- name: init-check
image: "us-east1-docker.pkg.dev/versa-kafka-poc/spark-job-repo/ss-main-vkp:0.0.2"
command: ["sh", "-c", "ls -lrt /opt/spark/work-dir/"]
volumeMounts:
- name: work-dir
mountPath: /opt/spark/work-dir
executor:
cores: {{ .Values.executor.cores }}
instances: {{ .Values.executor.instances }}
memory: "{{ .Values.executor.memory }}"
labels:
version: "{{ .Values.sparkVersion }}"
volumeMounts:
- name: gcs-key
mountPath: /etc/secrets
readOnly: true
- name: work-dir
mountPath: /opt/spark/work-dir
volumes:
- name: gcs-key
secret:
secretName: {{ .Values.gcsKeySecret }}
# - name: custom-dir
- name: work-dir
deps:
jars:
- local:///opt/spark/jars/gcs-connector-hadoop3-latest.jar
pyFiles:
- local:///opt/spark/work-dir/streams.zip
- local:///opt/spark/work-dir/utils.zip
pythonVersion: "{{ .Values.pythonVersion }}"
sparkConf:
"spark.kubernetes.namespace": "spark-operator"
"spark.kubernetes.authenticate.driver.serviceAccountName": "spark"
"spark.hadoop.fs.gs.impl": "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem"
"spark.hadoop.fs.AbstractFileSystem.gs.impl": "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS"
"spark.hadoop.google.cloud.auth.service.account.enable": "true"
"spark.hadoop.google.cloud.auth.service.account.json.keyfile": "/etc/secrets/spark-gcs-key.json"
"spark.eventLog.enabled": "true"
"spark.eventLog.dir": "{{ .Values.sparkEventLogDir }}"
"spark.hadoop.fs.gs.auth.type": "SERVICE_ACCOUNT_JSON_KEYFILE"
hadoopConf:
"fs.gs.impl": "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem"
"fs.AbstractFileSystem.gs.impl": "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS"
"google.cloud.auth.service.account.enable": "true"
"google.cloud.auth.service.account.json.keyfile": "/etc/secrets/spark-gcs-key.json"
Error i get :
(base) Karans-MacBook-Pro:spark-k8s-operator karanalang$ kc logs -f structured-streaming-main-1727333392-driver -n spark-operator -c init-check
total 0
(base) Karans-MacBook-Pro:spark-k8s-operator karanalang$ kc logs -f svc/structured-streaming-main-1727333392-ui-svc -n spark-operator
Files local:///opt/spark/jars/gcs-connector-hadoop3-latest.jar from /opt/spark/jars/gcs-connector-hadoop3-latest.jar to /opt/spark/work-dir/gcs-connector-hadoop3-latest.jar
Files local:///opt/spark/work-dir/streams.zip from /opt/spark/work-dir/streams.zip to /opt/spark/work-dir/streams.zip
Exception in thread "main" java.nio.file.NoSuchFileException: /opt/spark/work-dir/streams.zip
at java.base/sun.nio.fs.UnixException.translateToIOException(Unknown Source)
at java.base/sun.nio.fs.UnixException.rethrowAsIOException(Unknown Source)
at java.base/sun.nio.fs.UnixException.rethrowAsIOException(Unknown Source)
at java.base/sun.nio.fs.UnixCopyFile.copy(Unknown Source)
at java.base/sun.nio.fs.UnixFileSystemProvider.copy(Unknown Source)
at java.base/java.nio.file.Files.copy(Unknown Source)
at org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$14(SparkSubmit.scala:441)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at scala.collection.TraversableLike.map(TraversableLike.scala:286)
at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
at scala.collection.AbstractTraversable.map(Traversable.scala:108)
at org.apache.spark.deploy.SparkSubmit.downloadResourcesToCurrentDirectory$1(SparkSubmit.scala:429)
at org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$19(SparkSubmit.scala:459)
How to debug/fix this ?
tia!
copying the file to a different location eg /opt/spark/custom-dir resolved the issue, /opt/spark/work-dir is the work directory used and hence being blanked out when the pod is created