apache-sparkspark-jdbc

Spark JDBC "batch size" effect on insert


I wanted to know what effect the batchsize option has on an insert operation using spark jdbc. Does this mean a bulk insert using one insert command similar to a bulk insert or a batch of insert commands that gets committed at the end?

Could someone clarify as this is not clearly mentioned in the documentation?


Solution

  • According to the source code, the option batchsize is used for executeBatch method of PreparedStatement, which is able to submit a batch of commands to the database for execution.

    The key code:

    val stmt = conn.prepareStatement(insertStmt)
    while (iterator.hasNext) {
      stmt.addBatch()
      rowCount += 1
      if (rowCount % batchSize == 0) {
          stmt.executeBatch()
          rowCount = 0
        }
    }
    
    if (rowCount > 0) {
         stmt.executeBatch()
    }
    

    Back to your question, it is true that there are

    a batch of insert commands

    But, the statement gets committed at the end is wrong, because it is fine for only part of those inserts to execute successfully. No extra transaction requirements here. BTW, Spark will adopt the default isolation level if it is not specified.