I need help when deploying my custom docker image onto a Databricks cluster, knowing if I need to match the exact folder structure for /databricks, or something else, to avoid my error: java.lang.SecurityException: SHA1 digest error for META-INF/services/java.sql.Driver.
The last databricks namespace call in my stack trace was to com.databricks.backend.daemon.driver.DriverDaemon$.preloadJdbcDrivers(DriverDaemon.scala:943). The Dockerfile is quoted at the end.
I've been spending time trying to create a docker image for developing with PySpark/Delta locally across our team's Macs and Windows machines. This works. I've followed the exact specification of Runtime 16.3 on Databricks.
My next step was to try to have that exact same custom image be the environment our code runs in Databricks as well, by adding it to a general-purpose compute cluster's configuration, under the Advanced options > Docker tab. Starting this cluster, I get following error:
INFO DriverDaemon$: Skipping class preloading because it is not enabled in conf
INFO log: Logging initialized @21122ms to shaded.v9_4.org.eclipse.jetty.util.log.Slf4jLog
INFO DriverDaemon$: Sent a notification to chauffeur about startup exception, took 6031
ERROR DriverDaemon$: XXX Fatal uncaught exception. Terminating driver.
java.lang.SecurityException: SHA1 digest error for META-INF/services/java.sql.Driver
when the DriverDaemon is invoked through the command:
/opt/java8/bin/java [LOTS OF OPTIONS] \
-cp /databricks/hadoop-safety-jars/*:/databricks/spark/dbconf/jets3t/:/databricks/spark/dbconf/log4j/driver:/databricks/hive/conf:/databricks/spark/dbconf/hadoop:/databricks/jars/* \
com.databricks.backend.daemon.driver.DriverDaemon
Of course, my image does not install Spark nor Hadoop in the /databricks folder. I thought that since Databricks' use of my image is discarding whatever CMD or ENTRYPOINT I pass it in the Dockerfile, it will link my installed Spark regardless of its location as long as I make it accessible to PATH, or it will inject its own Spark in that location. I did not find docs on what happens at cluster start up, and if my custom image needs to follow this draconic folder structure, why do they even mention it as an alternative at The Docker Databricks Docs?
I've tried accessing the /databricks folder from within a cluster with no Docker on it, just to see what JARs need that SHA check, and I can list the JARs with a reference to META-INF/services/java.sql.Driver:
/databricks/jars/----ws_3_5--mvn--hadoop3--org.postgresql--postgresql--org.postgresql__postgresql__42.6.0.jar
/databricks/jars/----ws_3_5--third_party--mssql-jdbc--mssql-jdbc--789028999--com.microsoft.sqlserver__mssql-jdbc__11.2.3.jre8.jar
/databricks/jars/----ws_3_5--third_party--mssql--mssql-hive-2.3__hadoop-3.2_2.12--1153968230--com.microsoft.sqlserver__mssql-jdbc__11.2.2.jre8.jar
/databricks/jars/----ws_3_5--mvn--hadoop3--org.apache.derby--derby--org.apache.derby__derby__10.14.2.0.jar
/databricks/jars/----ws_3_5--mvn--hadoop3--org.apache.hive--hive-jdbc--org.apache.hive__hive-jdbc__2.3.9.jar
/databricks/jars/----ws_3_5--third_party--snowflake-jdbc--net.snowflake__snowflake-jdbc__shaded---414110472--net.snowflake__snowflake-jdbc__3.16.1.jar
/databricks/jars/----ws_3_5--mvn--hadoop3--org.xerial--sqlite-jdbc--org.xerial__sqlite-jdbc__3.42.0.0.jar
/databricks/jars/----ws_3_5--third_party--mariadb-java-client--org.mariadb.jdbc__mariadb-java-client__2.7.9.jar
/databricks/jars/----ws_3_5--third_party--bigquery-jdbc--bigquery-driver-shaded---846918551--GoogleBigQueryJDBC42.jar
/databricks/jars/----ws_3_5--third_party--spark-jdbc--databricks-jdbc-driver-shaded---1191706110--DatabricksJDBC.jar
/databricks/jars/spark-excel_2.12-3.5.0_0.20.3.jar
My Dockerfile is this:
FROM ubuntu:noble-20241015
ARG DEBIAN_FRONTEND=noninteractive
# Install Python 3.12.3 and utilities like pip and venv,
# some of which require https connections and hence, certificates;
# Will symlink python into local user bin for convenience (as it will be in the PATH);
# Then deletes the apt cache at /var/cache/apt/archives to reduce image size;
# It also gets rid of the apt lists at /var/lib/apt/lists/* to reduce image size.
RUN apt-get update && \
apt-get install -y --no-install-recommends \
python3.12 python3.12-dev \
ca-certificates bash iproute2 coreutils procps sudo curl \
build-essential pkg-config cmake \
# dbus-python==1.3.2 dependency:
dbus libdbus-glib-1-dev \
# psycopg2==2.9.3 dependency:
libpq-dev \
# PyGObject==3.48.2 dependency, when installing pycairo==1.28.0:
libcairo2-dev gobject-introspection libgirepository1.0-dev && \
ln -s /usr/bin/python3.12 /usr/local/bin/python && \
ln -s /usr/bin/python3.12 /usr/local/bin/python3 && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
# https://cdn.azul.com/zulu/bin/zulu8.33.0.1-ca-jdk8.0.192-linux_x64.tar.gz
# https://cdn.azul.com/zulu-embedded/bin/zulu8.33.0.135-jdk1.8.0_192-linux_aarch64.tar.gz
# https://cdn.azul.com/zulu/bin/zulu17.54.21-ca-jdk17.0.13-linux_x64.tar.gz
# https://cdn.azul.com/zulu/bin/zulu17.54.21-ca-jdk17.0.13-linux_aarch64.tar.gz
# Multi-architecture build for Java 8 and 17;
ARG TARGETARCH
RUN set -eux; \
case "${TARGETARCH}" in \
amd64) \
ARCH=x64; \
Z8_ID="zulu8.33.0.1-ca-jdk8.0.192"; \
EMBEDDED=""; \
Z17_ID="zulu17.54.21-ca-jdk17.0.13"; \
;; \
arm64) \
ARCH=aarch64; \
Z8_ID="zulu8.33.0.135-jdk1.8.0_192"; \
EMBEDDED="-embedded"; \
Z17_ID="zulu17.54.21-ca-jdk17.0.13"; \
;; \
*) \
echo "Unsupported arch ${TARGETARCH}"; exit 1;; \
esac && \
for keyval in "${Z8_ID}:java8" "${Z17_ID}:java17"; do \
IFS=':'; set -- $keyval; IFS=' '; \
BUILD_ID=$1; DEST=$2; \
URL="https://cdn.azul.com/zulu${EMBEDDED}/bin/${BUILD_ID}-linux_${ARCH}.tar.gz"; \
EMBEDDED=""; \
curl -fsSL -o /tmp/jdk.tgz "$URL"; \
mkdir -p /opt/${DEST}; \
tar -xzf /tmp/jdk.tgz --strip-components=1 -C /opt/${DEST}; \
rm -f /tmp/jdk.tgz*; \
done
# Register Java 8 (priority 200) and Java 17 (priority 100) as alternatives,
# including all matching slave links so every tool stays in sync.
RUN update-alternatives --install /usr/bin/java java /opt/java8/bin/java 200 \
--slave /usr/bin/javac javac /opt/java8/bin/javac \
--slave /usr/bin/jar jar /opt/java8/bin/jar \
--slave /usr/bin/javadoc javadoc /opt/java8/bin/javadoc\
--slave /usr/bin/jcmd jcmd /opt/java8/bin/jcmd \
--slave /usr/bin/jmap jmap /opt/java8/bin/jmap \
--slave /usr/bin/jps jps /opt/java8/bin/jps \
--slave /usr/bin/jstack jstack /opt/java8/bin/jstack \
--slave /usr/bin/keytool keytool /opt/java8/bin/keytool && \
update-alternatives --install /usr/bin/java java /opt/java17/bin/java 100 \
--slave /usr/bin/javac javac /opt/java17/bin/javac \
--slave /usr/bin/jar jar /opt/java17/bin/jar \
--slave /usr/bin/javadoc javadoc /opt/java17/bin/javadoc\
--slave /usr/bin/jshell jshell /opt/java17/bin/jshell\
--slave /usr/bin/jcmd jcmd /opt/java17/bin/jcmd \
--slave /usr/bin/jmap jmap /opt/java17/bin/jmap \
--slave /usr/bin/jps jps /opt/java17/bin/jps \
--slave /usr/bin/jstack jstack /opt/java17/bin/jstack \
--slave /usr/bin/keytool keytool /opt/java17/bin/keytool
ENV JAVA_HOME=/opt/java8
ENV PATH="${JAVA_HOME}/bin:${PATH}"
# Makes sure that the pip we use is the Runtime 16.3 specified one (24.2);
# Bypasses the system pip PEP668 restriction by not installing python3-pip through apt-get,
# but rather here using the get-pip.py script;
RUN set -eux; \
curl -fsSL https://bootstrap.pypa.io/get-pip.py -o /tmp/get-pip.py && \
python3 /tmp/get-pip.py pip==24.2.0 --break-system-packages --ignore-installed && \
rm -f /tmp/get-pip.py && \
pip --version
# Install all the required packages for the Runtime 16.3;
COPY requirements.txt /usr/local/venvs/requirements.txt
RUN python3 -m pip install --break-system-packages --ignore-installed -r /usr/local/venvs/requirements.txt
# https://archive.apache.org/dist/spark/spark-${SPARK_VER}/spark-${SPARK_VER}-bin-without-hadoop.tgz
# https://archive.apache.org/dist/spark/spark-3.5.2/spark-3.5.2-bin-without-hadoop.tgz
# Download Spark 3.5.2 WITHOUT Hadoop, check it against MITM attacks,
# and install PySpark into the local Python environment;
# • downloads & verifies the .sha512 signature
# • unpacks under /opt/spark‑<ver> -> /opt/spark symlink
# • removes tar & checksum in‑layer to keep the image slim
ARG SPARK_VER=3.5.2
RUN set -eux; \
cd /opt; \
URL="https://archive.apache.org/dist/spark/spark-${SPARK_VER}/spark-${SPARK_VER}-bin-without-hadoop.tgz"; \
curl -fsSL -O "$URL"; \
curl -fsSL -O "${URL}.sha512"; \
sha512sum --check "spark-${SPARK_VER}-bin-without-hadoop.tgz.sha512"; \
tar -xzf "spark-${SPARK_VER}-bin-without-hadoop.tgz"; \
mv spark-${SPARK_VER}-bin-without-hadoop spark-${SPARK_VER}; \
ln -s spark-${SPARK_VER} spark; \
rm -f /opt/*.tgz* && \
# build PySpark wheel & install (matching JVM bits)
cd /opt/spark/python; \
python3 setup.py -q sdist && \
python3 -m pip install --no-cache-dir --break-system-packages dist/pyspark-${SPARK_VER}.tar.gz && \
rm -rf build dist
ENV SPARK_HOME=/opt/spark
ENV SPARK_JAVA_HOME=/opt/java17
# Multi-architecture build for Hadoop 3.3.6,
# and setting environment variables so we can use Hadoop:
# • downloads & verifies the .sha512 signature
# • unpacks under /opt/hadoop‑<ver> -> /opt/hadoop symlink
# • removes tar & checksum in‑layer to keep the image slim
ARG HADOOP_VER=3.3.6
RUN set -eux && \
cd /opt && \
case "$TARGETARCH" in \
amd64) \
FILE="hadoop-${HADOOP_VER}.tar.gz" ;; \
arm64) \
FILE="hadoop-${HADOOP_VER}-aarch64.tar.gz" ;; \
*) \
echo "Unsupported arch $TARGETARCH" >&2; exit 1 ;; \
esac && \
URL="https://downloads.apache.org/hadoop/common/hadoop-${HADOOP_VER}/${FILE}" && \
curl -fsSL --http1.1 -O "$URL" && \
curl -fsSL "${URL}.sha512" \
| sed "s|hadoop-${HADOOP_VER}-RC1\.tar\.gz|${FILE}|" \
> "${FILE}.sha512" && \
sha512sum --check "$FILE.sha512" && \
tar -xzf "$FILE" && \
ln -s "hadoop-${HADOOP_VER}" hadoop && \
rm -f /opt/*.tgz*
ENV HADOOP_HOME=/opt/hadoop
ENV PATH="$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH"
# We run 'hadoop classpath' in a shell to capture its output,
# and set it in the shell script to be read at Spark runtime;
# Alongside the classpath, we specify the use of optional jars for Azure connection;
# This also prepares the Spark environment to use Delta Lake:
RUN set -eux && \
SPARK_DIST_CP="$(hadoop classpath)" && \
echo "export SPARK_DIST_CLASSPATH=\"${SPARK_DIST_CP}:${HADOOP_HOME}/share/hadoop/tools/lib/*\"" \
>> /opt/spark/conf/spark-env.sh && \
echo 'spark.sql.extensions io.delta.sql.DeltaSparkSessionExtension' \
>> /opt/spark/conf/spark-defaults.conf && \
echo 'spark.sql.catalog.spark_catalog org.apache.spark.sql.delta.catalog.DeltaCatalog' \
>> /opt/spark/conf/spark-defaults.conf
# We also set the env variable to read the Spark config we just set:
ENV SPARK_CONF_DIR=/opt/spark/conf
# Download Delta Lake JARs compatible with Spark 3.5.2:
RUN set -eux && \
cd /opt/spark/jars && \
curl -fsSL -O https://repo1.maven.org/maven2/io/delta/delta-spark_2.12/3.3.0/delta-spark_2.12-3.3.0.jar && \
curl -fsSL -O https://repo1.maven.org/maven2/io/delta/delta-storage/3.3.0/delta-storage-3.3.0.jar
Hopefully this question doesn't qualify for the following caveat in the guide to ask good questions:
Questions on professional server, networking, or related infrastructure administration are off-topic for Stack Overflow unless they directly involve programming or programming tools.
It seems the issue lies with the version of Java 8 used, as the minimum requirement of u192 is not capable of dealing with longer SHAs. I used u382 and the verification step passed successfully.