I read the data from MS SQL server using Spark-jdbc with Scala and I would like to partition this data by the specified column. I do not want to set lower and upper bounds for the partition column manually. Can I read some kind of maximum and minimum value in this field and set it as upper/lower bounds? Also, using this query I want to read all the data from the database. For now the mechanism for querying looks like this:
def jdbcOptions() = Map[String,String](
"driver" -> "db.driver",
"url" -> "db.url",
"user" -> "db.user",
"password" -> "db.password",
"customSchema" -> "db.custom_schema",
"dbtable" -> "(select * from TestAllData where dayColumn > 'dayValue') as subq",
"partitionColumn" -> "db.partitionColumn",
"lowerBound" -> "1",
"upperBound" -> "30",
"numPartitions" -> "5"
}
val dataDF = sparkSession
.read
.format("jdbc")
.options(jdbcOptions())
.load()
In the case that dayColumn
is a number or date field you can retrieve the boundaries with the next code:
def jdbcBoundOptions() = Map[String,String]{
"driver" -> "db.driver",
"url" -> "db.url",
"user" -> "db.user",
"password" -> "db.password",
"customSchema" -> "db.custom_schema",
"dbtable" -> "(select max(db.partitionColumn), min(db.partitionColumn) from TestAllData where dayColumn > 'dayValue') as subq",
"numPartitions" -> "1"
}
val boundRow = sparkSession
.read
.format("jdbc")
.options(jdbcBoundOptions())
.load()
.first()
val maxDay = boundRow.getInt(0)
val mimDay = boundRow.getInt(1)
Note that numPartitions
must be 1 and we do not need to specify partitioning details in this case as described in Spark documentation.
Finally you can use the retrieved boundaries for the original query:
def jdbcOptions() = Map[String,String]{
"driver" -> "db.driver",
"url" -> "db.url",
"user" -> "db.user",
"password" -> "db.password",
"customSchema" -> "db.custom_schema",
"dbtable" -> "(select * from TestAllData where dayColumn > 'dayValue') as subq",
"partitionColumn" -> "db.partitionColumn",
"lowerBound" -> minDay.toString,
"upperBound" -> maxDay.toString,
"numPartitions" -> "5"
}