I have a large amount of data stored in GridDB and want to process it using Apache Spark. However, I'm unsure how to connect GridDB to Spark or use GridDB as a data source.
Here's what I have so far:
val spark = SparkSession.builder().appName("GridDB-Spark").getOrCreate()
val df = spark.read.format("jdbc")
.option("url", "jdbc:postgresql://localhost:5432/my_container")
.option("driver", "org.postgresql.Driver")
.option("dbtable", "my_table")
.option("user", "my_username")
.option("password", "my_password")
.load()
This code tries to connect to a Postgres database, but I need to learn how to modify it to work with GridDB. I have the following points that I am struggling with:
Download JDBC Driver from URL below, please choose the right version https://central.sonatype.com/artifact/com.github.griddb/gridstore-jdbc/5.1.0/versions
Copy the jar file into $SPARK_HOME/jars directory
Prepare local griddb env, this step is optional, you can use your own env
docker run -d --network="host" griddb/griddb
docker run --network="host" griddb/jdbc
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("GriddbJDBC") \
.getOrCreate()
df = spark.read \
.format("jdbc") \
.option("url", "jdbc:gs:///dockerGridDB/public?notificationMember=127.0.0.1:20001") \
.option("driver", "com.toshiba.mwcloud.gs.sql.Driver") \
.option("query", "select * from SampleJDBC_Select") \
.option("user", "admin") \
.option("password", "admin") \
.load()
df.show()
# spark-submit spark-griddb.py
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
23/03/14 17:28:09 INFO SparkContext: Running Spark version 3.2.1
23/03/14 17:28:10 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/03/14 17:28:10 INFO ResourceUtils: ==============================================================
23/03/14 17:28:10 INFO ResourceUtils: No custom resources configured for spark.driver.
23/03/14 17:28:10 INFO ResourceUtils: ==============================================================
23/03/14 17:28:10 INFO SparkContext: Submitted application: GriddbJDBC
.
.
.
23/03/14 17:28:14 INFO CodeGenerator: Code generated in 12.26218 ms
+---+-----+
| id|value|
+---+-----+
| 0|test0|
| 1|test1|
| 2|test2|
| 3|test3|
| 4|test4|
+---+-----+
.
.
.
23/03/14 17:28:14 INFO ShutdownHookManager: Deleting directory /tmp/spark-c9c34d37-861a-472b-a111-cc86c0d73747