javahibernatebatching

Hibernate: Why should I manually flush() even if I set batch_size in configuration file?


I'm learning to use java's hibernate 5.2.10. I started with a few tutorials online but faced the following question.

When using batching, all the tutorials I have seen first set the hibernate.jdbc.batch_size in the configuration file. After that the code is similar to this:

Session session = SessionFactory.openSession();
Transaction tx = session.beginTransaction();
for ( int i=0; i<1000000; i++ ) 
{
    Student student = new Student(.....);
    session.save(employee);
    if( i % 50 == 0 ) // Same as the JDBC batch size
    { 
        //flush a batch of inserts and release memory:
        session.flush();
        session.clear();
    }
}
tx.commit();
session.close();

Why should I be doing flush() and clear() manually? Isn't this something that should be done automatically by hibernate since I have already set hibernate.jdbc.batch_size in the configuration file?

For me it seems like I'm batching my operations manually, so why do I have to set the value of hibernate.jdbc.batch_size then?


Solution

  • The use specifying a JDBC batch_size value in the configuration versus manually controlling the flush/clear of the persistence context are two independent strategies and serve very different purposes.

    The primary goal for using flush() paired with clear() is to minimize the memory consumption on the java application side used by the PersistenceContext as you save your student records. It's important to remember that when you're using a stateful Session as your example illustrates, Hibernate maintains an attached/managed copy of the entity in memory and so it's important to clear and flush this to the database at regular intervals to avoid running out of memory or impacting performance.

    The JDBC batch_size setting itself influences how frequent the actual driver flushes statements to the database in order to improve performance. Let's take a slightly modified example:

    Session session = sessionFactory.openSession();
    try {
      session.getTransaction().begin();
      for ( int i = 0; i < 10000; ++i ) {
        Student student = new Student();
        ...        
        session.save( student );
      }
      session.getTransaction().commit();
    }
    catch( Throwable t ) {
      if ( session.getTransaction().getStatus() == TransactionStatus.ACTIVE ) {
        session.getTransaction().rollback();
      }
      throw t;
    }
    finally {
      session.close();
    }
    

    As you can see, we're not using flush() or clear() here.

    What happens here is that as Hibernate performs the flush at commit time, the driver will send batch_size number of inserts to the database in bulk rather than each one individually. So rather than 10,000 network packets being sent, if batch_size were 250 it would only send 40 packets.

    Now what is important to recognize is there are factors that can disable batching such as using identity based identifiers like IDENTITY or AUTO_INCREMENT. Why?

    That is because in order for Hibernate to store the entity in the PersistenceContext, it must know the entity's ID and the only way to obtain that value when using IDENTITY based identifier generation is to actually query the database for the value after each insert operation. Therefore, inserts cannot be batched.

    This is precisely why people doing bulk insert operations often observe poor performance because they don't realize the impact the identifier generation strategy they pick can have.

    It's best to use some type of cached sequence generator or some manually application assigned identifier instead when you want to optimize batch loading.

    Now going back to your example using flush() and clear(), the same problems hold true with identifier generation strategy. If you want those operations to be bulk/batch sent to the database, be mindful of the identifier strategy you're using for Student.