[SOLVED] How can I use GridDB with Apache Spark?

How can I use GridDB with Apache Spark?

I have a large amount of data stored in GridDB and want to process it using Apache Spark. However, I'm unsure how to connect GridDB to Spark or use GridDB as a data source.

Here's what I have so far:

val spark = SparkSession.builder().appName("GridDB-Spark").getOrCreate()
val df = spark.read.format("jdbc")
  .option("url", "jdbc:postgresql://localhost:5432/my_container")
  .option("driver", "org.postgresql.Driver")
  .option("dbtable", "my_table")
  .option("user", "my_username")
  .option("password", "my_password")
  .load()

This code tries to connect to a Postgres database, but I need to learn how to modify it to work with GridDB. I have the following points that I am struggling with:

What do I need to connect to my GridDB database and use it as a data source in Spark?
Are there any best practices or recommendations for using GridDB with Spark?

Solution

Download JDBC Driver from URL below, please choose the right version https://central.sonatype.com/artifact/com.github.griddb/gridstore-jdbc/5.1.0/versions
Copy the jar file into $SPARK_HOME/jars directory
Prepare local griddb env, this step is optional, you can use your own env

docker run -d --network="host" griddb/griddb
docker run --network="host" griddb/jdbc

Prepare spark-griddb.py file that query data using jdbc (I'm using python, but it is very similar with scala), if you using your own environment, please replace the ipaddr (127.0.0.1), port (20001), database name (dockerGridDB), table name (SampleJDBC_Select), username (admin), password (admin) properly

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession \
        .builder \
        .appName("GriddbJDBC") \
        .getOrCreate()

df = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:gs:///dockerGridDB/public?notificationMember=127.0.0.1:20001") \
    .option("driver", "com.toshiba.mwcloud.gs.sql.Driver") \
    .option("query", "select * from SampleJDBC_Select") \
    .option("user", "admin") \
    .option("password", "admin") \
    .load()

df.show()

Test it

# spark-submit spark-griddb.py
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
23/03/14 17:28:09 INFO SparkContext: Running Spark version 3.2.1
23/03/14 17:28:10 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/03/14 17:28:10 INFO ResourceUtils: ==============================================================
23/03/14 17:28:10 INFO ResourceUtils: No custom resources configured for spark.driver.
23/03/14 17:28:10 INFO ResourceUtils: ==============================================================
23/03/14 17:28:10 INFO SparkContext: Submitted application: GriddbJDBC
.
.
.
23/03/14 17:28:14 INFO CodeGenerator: Code generated in 12.26218 ms
+---+-----+
| id|value|
+---+-----+
|  0|test0|
|  1|test1|
|  2|test2|
|  3|test3|
|  4|test4|
+---+-----+
.
.
.
23/03/14 17:28:14 INFO ShutdownHookManager: Deleting directory /tmp/spark-c9c34d37-861a-472b-a111-cc86c0d73747