jdbcapache-spark-sqlhdfsdelta-lakedbeaver

Setting up a DBeaver 25.0.1 connection to a Delta Lake v2.4 Parquet table on Hadoop 3.3.4 filesystem


I am trying to create a new connection from DBeaver to a Delta Lake Parquet file which is located on the HDFS filesystem which I successfully created with a Spark/Hadoop/Scala/io.delta application.

(NOTE : I've read that it can be done without Thrift server, but just a JDBC driver setup and no database name. If this is not true please let me know.)

First setting up a driver via DBeaver driver manager adding the needed libraries:

DBeaver details : Version 25.0.1.202503240840

Using find class, the driver class is returned in the dialog screen: org.apache.spark.sql.execution.datasources.jdbc.DriverWrapper from the spark-sql_2.13-3.4.3.jar

Further settings are

Other cluster environment details are:

The error that I receive is:

Error in driver initialization 'delta-lake-driver'

Message from the dialog screen:

Can't create driver instance (class 'org.apache.spark.sql.execution.datasources.jdbc.DriverWrapper').
Error creating driver 'delta-lake-driver' instance.
Most likely required jar files are missing.
You should configure jars in driver settings.
Reason: can't load driver class 'org.apache.spark.sql.execution.datasources.jdbc.DriverWrapper'
    org/apache/spark/SparkSQLFeatureNotSupportedException
      org.apache.spark.SparkSQLFeatureNotSupportedException
      org.apache.spark.SparkSQLFeatureNotSupportedException

I haven't seen any examples for DBeaver with Apache Spark open source. Currently, I have working connections with MySQL on all 3 nodes in my cluster. The JAR versions seem to be correct since writing from the Apache Spark jobs work with these versions.

Is there anyone who tried something similar using Apache Spark open source, Delta Lake, DBeaver?

I was expecting that the driver manager had all needed JARs available. The creation of the table and reading out from a Spark job was very easy, just by including extending the built.sbt and modifying the command using "delta" as format. So I thought it would be even more easy to just connect to that table using DBeaver for some general querying.

I was expecting the Library list to be more than enough, mentioned above. There isn't more detail than just the "Most likely required jar files are missing". I could add all the jars the dependency tree reports, but that list is rather huge.


Solution

  • I will go ahead with the thrift server setup.

    ...making it possible for external sql clients (like DBeaver) to connect to spark sql (which itself is the sql parser and engine, like databases have) and with that this thrift server should then act as a communication gateway (multiple clients/sessions) to the spark layer. The spark sql does not provide that itself because spark jobs are not external clients but directly running within that environment unlike external sql clients.

    tx