sql-serverscalaapache-sparkspark-jdbc

Calculate lower and upper bounds for partition Spark JDBC


I read the data from MS SQL server using Spark-jdbc with Scala and I would like to partition this data by the specified column. I do not want to set lower and upper bounds for the partition column manually. Can I read some kind of maximum and minimum value in this field and set it as upper/lower bounds? Also, using this query I want to read all the data from the database. For now the mechanism for querying looks like this:

def jdbcOptions() = Map[String,String](
    "driver" -> "db.driver",
    "url" -> "db.url",
    "user" -> "db.user",
    "password" -> "db.password",
    "customSchema" -> "db.custom_schema",
    "dbtable" -> "(select * from TestAllData where dayColumn > 'dayValue') as subq",
    "partitionColumn" -> "db.partitionColumn",
    "lowerBound" -> "1",
    "upperBound" -> "30",
    "numPartitions" -> "5"
}

    val dataDF = sparkSession
      .read
      .format("jdbc")
      .options(jdbcOptions())
      .load()

Solution

  • In the case that dayColumn is a number or date field you can retrieve the boundaries with the next code:

    def jdbcBoundOptions() = Map[String,String]{
        "driver" -> "db.driver",
        "url" -> "db.url",
        "user" -> "db.user",
        "password" -> "db.password",
        "customSchema" -> "db.custom_schema",
        "dbtable" -> "(select max(db.partitionColumn), min(db.partitionColumn) from TestAllData where dayColumn > 'dayValue') as subq",
        "numPartitions" -> "1"
    }
    
    val boundRow = sparkSession
        .read
        .format("jdbc")
        .options(jdbcBoundOptions())
        .load()
        .first()
    
    val maxDay = boundRow.getInt(0)
    val mimDay = boundRow.getInt(1)
    

    Note that numPartitions must be 1 and we do not need to specify partitioning details in this case as described in Spark documentation.

    Finally you can use the retrieved boundaries for the original query:

    def jdbcOptions() = Map[String,String]{
        "driver" -> "db.driver",
        "url" -> "db.url",
        "user" -> "db.user",
        "password" -> "db.password",
        "customSchema" -> "db.custom_schema",
        "dbtable" -> "(select * from TestAllData where dayColumn > 'dayValue') as subq",
        "partitionColumn" -> "db.partitionColumn",
        "lowerBound" -> minDay.toString,
        "upperBound" -> maxDay.toString,
        "numPartitions" -> "5"
    }