scalaapache-sparkdatasourcedata-processinggriddb

How can I use GridDB with Apache Spark?


I have a large amount of data stored in GridDB and want to process it using Apache Spark. However, I'm unsure how to connect GridDB to Spark or use GridDB as a data source.

Here's what I have so far:

val spark = SparkSession.builder().appName("GridDB-Spark").getOrCreate()
val df = spark.read.format("jdbc")
  .option("url", "jdbc:postgresql://localhost:5432/my_container")
  .option("driver", "org.postgresql.Driver")
  .option("dbtable", "my_table")
  .option("user", "my_username")
  .option("password", "my_password")
  .load()

This code tries to connect to a Postgres database, but I need to learn how to modify it to work with GridDB. I have the following points that I am struggling with:

  1. What do I need to connect to my GridDB database and use it as a data source in Spark?
  2. Are there any best practices or recommendations for using GridDB with Spark?

Solution

    1. Download JDBC Driver from URL below, please choose the right version https://central.sonatype.com/artifact/com.github.griddb/gridstore-jdbc/5.1.0/versions

    2. Copy the jar file into $SPARK_HOME/jars directory

    3. Prepare local griddb env, this step is optional, you can use your own env

    docker run -d --network="host" griddb/griddb
    docker run --network="host" griddb/jdbc
    
    1. Prepare spark-griddb.py file that query data using jdbc (I'm using python, but it is very similar with scala), if you using your own environment, please replace the ipaddr (127.0.0.1), port (20001), database name (dockerGridDB), table name (SampleJDBC_Select), username (admin), password (admin) properly
    import pyspark
    from pyspark.sql import SparkSession
    spark = SparkSession \
            .builder \
            .appName("GriddbJDBC") \
            .getOrCreate()
    
    df = spark.read \
        .format("jdbc") \
        .option("url", "jdbc:gs:///dockerGridDB/public?notificationMember=127.0.0.1:20001") \
        .option("driver", "com.toshiba.mwcloud.gs.sql.Driver") \
        .option("query", "select * from SampleJDBC_Select") \
        .option("user", "admin") \
        .option("password", "admin") \
        .load()
    
    df.show()
    
    1. Test it
    # spark-submit spark-griddb.py
    Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
    23/03/14 17:28:09 INFO SparkContext: Running Spark version 3.2.1
    23/03/14 17:28:10 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    23/03/14 17:28:10 INFO ResourceUtils: ==============================================================
    23/03/14 17:28:10 INFO ResourceUtils: No custom resources configured for spark.driver.
    23/03/14 17:28:10 INFO ResourceUtils: ==============================================================
    23/03/14 17:28:10 INFO SparkContext: Submitted application: GriddbJDBC
    .
    .
    .
    23/03/14 17:28:14 INFO CodeGenerator: Code generated in 12.26218 ms
    +---+-----+
    | id|value|
    +---+-----+
    |  0|test0|
    |  1|test1|
    |  2|test2|
    |  3|test3|
    |  4|test4|
    +---+-----+
    .
    .
    .
    23/03/14 17:28:14 INFO ShutdownHookManager: Deleting directory /tmp/spark-c9c34d37-861a-472b-a111-cc86c0d73747