jdbcapache-spark-sqlhdfsdelta-lakedbeaver

Setting up a DBeaver 25.0.1 connection to a Delta Lake v2.4 Parquet table on Hadoop 3.3.4 filesystem


I am trying to create a new connection from DBeaver to a Delta Lake Parquet file which is located on the HDFS filesystem which I successfully created with a Spark/Hadoop/Scala/io.delta application.

(NOTE : I've read that it can be done without Thrift server, but just a JDBC driver setup and no database name. If this is not true please let me know.)

First setting up a driver via DBeaver driver manager adding the needed libraries:

DBeaver details : Version 25.0.1.202503240840

Using find class, the driver class is returned in the dialog screen: org.apache.spark.sql.execution.datasources.jdbc.DriverWrapper from the spark-sql_2.13-3.4.3.jar

Further settings are

Other cluster environment details are:

The error that I receive is:

Error in driver initialization 'delta-lake-driver'

Message from the dialog screen:

Can't create driver instance (class 'org.apache.spark.sql.execution.datasources.jdbc.DriverWrapper').
Error creating driver 'delta-lake-driver' instance.
Most likely required jar files are missing.
You should configure jars in driver settings.
Reason: can't load driver class 'org.apache.spark.sql.execution.datasources.jdbc.DriverWrapper'
    org/apache/spark/SparkSQLFeatureNotSupportedException
      org.apache.spark.SparkSQLFeatureNotSupportedException
      org.apache.spark.SparkSQLFeatureNotSupportedException

I haven't seen any examples for DBeaver with Apache Spark open source. Currently, I have working connections with MySQL on all 3 nodes in my cluster. The JAR versions seem to be correct since writing from the Apache Spark jobs work with these versions.

Is there anyone who tried something similar using Apache Spark open source, Delta Lake, DBeaver?

I was expecting that the driver manager had all needed JARs available. The creation of the table and reading out from a Spark job was very easy, just by including extending the built.sbt and modifying the command using "delta" as format. So I thought it would be even more easy to just connect to that table using DBeaver for some general querying.

I was expecting the Library list to be more than enough, mentioned above. There isn't more detail than just the "Most likely required jar files are missing". I could add all the jars the dependency tree reports, but that list is rather huge.


Solution

  • Here is my solution for DBeaver connecting to Delta Lake tables and storing the related metadata (in mysql)

    (My environment is a spark standalone cluster with 4 worker nodes, using mysql for metastorage, hadoop 3.3.4 and spark 3.4.3, scala 2.13, delta lake 2.4)

    For creating and accessing a delta lake table I have used the following setup:

    DBeaver connection details
    JDBC url : jdbc:hive2://sparkworker:10000/default;user=hduser;password=<password>

    
    Driver settings 
    [Libraries]
    "hive-jdbc-3.1.3-standalone.jar"
    "io.delta_delta-core_2.13-2.4.0.jar" <- removed, included at thriftserver start
    "io.delta-storage-2.4.0.jar"         <- removed, included at thriftserver start
    [Class Name "should be found when pressing the button [Find Class]" ] "org.apache.hive.jdbc.HiveDriver"
    

    Prerequisites are a running thriftserver and a hive metastore
    (I've used mysql db as a metastore, storing all DDL and I am running it on a sparkworker named sparkworker).
    I've also made the delta core and storage jars available at my spark library location at

    /usr/local/spark/lib/io.delta_delta-core_2.13-2.4.0.jar
    /usr/local/spark/lib/io.delta-storage-2.4.0.jar
    

    Before starting the thriftserver I first run the hive metastore command:

    hive --service metastore &
    
        start-thriftserver.sh \
          --master spark://sparkmaster:7077 \
          --conf spark.executor.cores=1 \
          --conf spark.executor.memory=1g \
          --conf spark.sql.thriftServer.numExecutors=1 \
          --conf hive.metastore.uris=thrift://sparkworker:9083 \
          --conf spark.sql.warehouse.dir=hdfs://sparkmaster:9000/spark-warehouse \
          --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
          --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog \
          --jars /home/hduser/share/io.delta_delta-core_2.13-2.4.0.jar
    

    Check if all looks fine in the logfile :
    /usr/local/spark/logs/spark-hduser-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-sparkworker.out)

    I was able to create the following delta lake table from dbeaver :

        CREATE TABLE delta.`hdfs://sparkmaster:9000/delta-lake/dbeaver-delta-simple-tablename` 
        USING DELTA AS 
        SELECT col1 as id FROM VALUES (0), (1), (2), (3), (4)
        ;
    
        SELECT *
        FROM delta.`hdfs://sparkmaster:9000/delta-lake/dbeaver-delta-simple-tablename` 
        ;
    
        CREATE TABLE spark_created_database.dbeaver_delta_simple_tablename
        USING DELTA
        LOCATION 'hdfs://sparkmaster:9000/delta-lake/dbeaver-delta-simple-tablename'
        ;
        SELECT *
        FROM spark_created_database.dbeaver_delta_simple_tablename
        ;
    
        DESCRIBE FORMATTED spark_created_database.dbeaver_delta_simple_tablename
        ;
    

    In the second create I have explicitly given a table name. This would trigger the metastore to save the table metadata.

    (metastore at mysql)
    enter image description here

    (creating the tables, with and without tablename)metastore at mysql example

    Location at hadoop:

    location at hadoop

    While the thriftserver (job) is online, it can be monitored at :

    thrift job found running

    The executed queries from the external sql clients (beeline, dbeaver,...) will appear on the sql datafrom tab page

    Location at hadoop

    Its a matter of using compatible versions, proper grants, a running hive and thriftserver and spark setup, making the needed jars available per context (connection, spark..).

    I assume that because I am using the default spark resource manager with this spark standalone cluster, any job will executed while the thrift/jdbc/odbc server is running will be getting status WAIT. As soon the stop-thriftserver is executed, these waiting jobs will be running. YARN or any advanced resource manager should be able to solve this (named/dedicated driver node, thriftserver with just a few cores, currently all cores are taken by the thriftserver even though I set it to one core, having 12 threads...

    Nevertheless, I am able to create and query on delta lake tables from DBeaver :-)