Bulk insert or update with Hibernate?
Asked Answered
H

7

31

I need to consume a rather large amounts of data from a daily CSV file. The CSV contains around 120K records. This is slowing to a crawl when using hibernate. Basically, it seems hibernate is doing a SELECT before every single INSERT (or UPDATE) when using saveOrUpdate(); for every instance being persisted with saveOrUpdate(), a SELECT is issued before the actual INSERT or a UPDATE. I can understand why it's doing this, but its terribly inefficient for doing bulk processing, and I'm looking for alternatives

I'm confident that the performance issue lies with the way I'm using hibernate for this, since I got another version working with native SQL (that parses the CSV in the excat same manner) and its literally running circles around this new version)

So, to the actual question, does a hibernate alternative to mysqls "INSERT ... ON DUPLICATE" syntax exist?

Or, if i choose to do native SQL for this, can I do native SQL within a hibernate transaction? Meaning, will it support commit/rollbacks?

Hinds answered 8/9, 2011 at 14:14 Comment(2)
what do you mean by " hibernate is doing a SELECT before every single insert (or update) when using saveOrUpdate()." ? could you post the code you are using to save the data ? by the way 120k records is a huge data !Originality
Just found an article about batch processing in hibernateSuppressive
C
4

According to an answer to a similar question, it can be done by configuring Hibernate to insert objects using a custom stored procedure which uses your database's upsert functionality. It's not pretty, though.

Contreras answered 8/9, 2011 at 17:40 Comment(0)
C
43

There are many possible bottlenecks in to bulk operations. The best approach depends heavily on what your data looks like. Have a look at the Hibernate Manual section on batch processing.

At a minimum, make sure you are using the following pattern (copied from the manual):

Session session = sessionFactory.openSession();
Transaction tx = session.beginTransaction();

for ( int i=0; i<100000; i++ ) {
Customer customer = new Customer(.....);
session.save(customer);
    if ( i % 20 == 0 ) { //20, same as the JDBC batch size
        //flush a batch of inserts and release memory:
        session.flush();
        session.clear();
    }
}

tx.commit();
session.close();

If you are mapping a flat file to a very complex object graph you may have to get more creative, but the basic principal is that you have to find a balance between pushing good sized chunks of data to the database with each flush/commit and avoiding exploding the size of the session level cache.

Lastly, if you don't need Hibernate to handle any collections or cascading for your data to be correctly inserted, consider using a StatelessSession.

Corella answered 8/9, 2011 at 15:10 Comment(5)
I am flushing an clearing my session, I dont have memory issues with the code. I have issues with the extra select! :P I read through the manual, I cant find anything. The data is super simple, no cascading needed. I just need to get rid of the, for this task, redundant select that gets called 120K times :PHinds
@Hinds what is the approximate percentage of new entities in this operation (i.e. what percentage of the selects are actually unnecessary)? Are you using versioning?Corella
the actual percentage will variate from day to day. However, none of the selects should really be necessary. Most databases today (even "toy" ones like SQLite) provides functionality which will let you automatically update a record if the data already exists. (without having to poll it first, to find out if it exists :))Hinds
You're correct that the functionality exists in virtually any DB you'd be likely to use with Hibernate. But, as best as I can tell Hibernate makes using it impossible. If you don't want to bypass Hibernate and execute the SQL directly via JDBC, your only option is to find other ways to speed your overall import process (e.g. concurrency, tuning of connection pool settings, disable 2nd-level cache, etc.).Corella
Could you add the version for the stateless method?Vierra
O
5

From Hibernate Batch Processing For update i used the following :

Session session = sessionFactory.openSession();
Transaction tx = session.beginTransaction();

ScrollableResults employeeCursor = session.createQuery("FROM EMPLOYEE")
                                   .scroll();
int count = 0;

while ( employeeCursor.next() ) {
   Employee employee = (Employee) employeeCursor.get(0);
   employee.updateEmployee();
   seession.update(employee); 
   if ( ++count % 50 == 0 ) {
      session.flush();
      session.clear();
   }
}
tx.commit();
session.close();

But for insert i would go for jcwayne answer

Ormuz answered 26/3, 2014 at 8:43 Comment(0)
C
4

According to an answer to a similar question, it can be done by configuring Hibernate to insert objects using a custom stored procedure which uses your database's upsert functionality. It's not pretty, though.

Contreras answered 8/9, 2011 at 17:40 Comment(0)
S
4

High-throughput data export

If you only want to import data without doing any processing or transformation, then a tool like PostgreSQL COPY is the fastest way o import data.

Batch processing

However, if you need to do the transformation, data aggregation, correlation/merging between existing data and the incoming one, then you need application-level batch processing.

In this case, you want to flush-clear-commit regularly:

int entityCount = 50;
int batchSize = 25;
 
EntityManager entityManager = entityManagerFactory()
    .createEntityManager();
     
EntityTransaction entityTransaction = entityManager
    .getTransaction();
 
try {
    entityTransaction.begin();
 
    for (int i = 0; i < entityCount; i++) {
        if (i > 0 && i % batchSize == 0) {
            entityTransaction.commit();
            entityTransaction.begin();
 
            entityManager.clear();
        }
 
        Post post = new Post(
            String.format("Post %d", i + 1)
        );
         
        entityManager.persist(post);
    }
 
    entityTransaction.commit();
} catch (RuntimeException e) {
    if (entityTransaction.isActive()) {
        entityTransaction.rollback();
    }
    throw e;
} finally {
    entityManager.close();
}

Also, make sure you enable JDBC batching as well using the following configuration properties:

<property
    name="hibernate.jdbc.batch_size"
    value="25"
/>
 
<property
    name="hibernate.order_inserts"  
    value="true"
/>
 
<property
    name="hibernate.order_updates"  
    value="true"
/>

Bulk processing

Bulk processing is suitable when all rows match pre-defined filtering criteria, so you can use a single UPDATE to change all records.

However, using bulk updates that modify millions of records can increase the size of the redo log or end up taking lots of locks on database systems that still use 2PL (Two-Phase Locking), like SQL Server.

So, while the bulk update is the most efficient way to change many records, you have to pay attention to how many records are to be changed to avoid a long-running transaction.

Also, you can combine bulk update with optimistic locking so that other OLTP transactions won't lose the update done by the bulk processing process.

Schuler answered 13/7, 2018 at 11:30 Comment(0)
S
1

If you use sequence or native generator Hibernate will use a select to get the id:

<id name="id" column="ID">
    <generator class="native" />
</id>

You should use hilo or seqHiLo generator:

<id name="id" type="long" column="id">  
    <generator class="seqhilo">
        <param name="sequence">SEQ_NAME</param>
        <param name="max_lo">100</param>
    </generator>
</id>
Semidome answered 8/5, 2013 at 12:12 Comment(0)
H
0

The "extra" select is to generate the unique identifier for your data.

Switch to HiLo sequence generation and you can reduce the sequence roundtrips to the database by the number of the allocation size. Please note, there will be a gap in primary keys unless you adjust your sequence value for the HiLo generator

Hydroelectric answered 27/4, 2012 at 12:39 Comment(0)
I
0

To batch insert via hibernate's underlying jdbc connection: (using the same connection pool)

ArrayList<Long> resultIds = session.doReturningWork(connection -> {
    try (PreparedStatement ps = connection.prepareStatement(
        "INSERT INTO MyTable(..) VALUES (?, ?, ?)",
        Statement.RETURN_GENERATED_KEYS
    )) {
        for (..each inserted row..) {
            ps.setLong(1, myAbcId);
            ps.setString(2, myString);
            ps.setTimestamp(3, new java.sql.Timestamp(millisecondsSinceEpoch));
            ps.addBatch();
        }
        ps.executeBatch();

        ResultSet resultSet = ps.getGeneratedKeys();
        ArrayList<Long> ids = new ArrayList<>();
        while (resultSet.next()) {
            ids.add(resultSet.getLong(1));
        }

        return ids;
    }
});

This returns multiple batch-inserted id's all at once. (id's are returned in insert-order)

Requires rewriteBatchedStatements=true param on mysql jdbc-config connection url to work best (insert .. (), (), ..; vs insert..; insert..; ..). eg jdbc:mysql://127.0.0.1/MyDatabaseName?...&rewriteBatchedStatements=true

For me, a 7k row insert went from 2 minutes to 1 second using batch-insert. (I think this was over a high-latency db connection across the Atlantic) This article reports 332% faster: https://java-persistence-performance.blogspot.com/2013/05/batch-writing-and-dynamic-vs.html.

Source: https://groups.google.com/g/jdbi/c/ZDqnfhK758g?pli=1 "I can do it with JDBC" - Plap

To verify batch-inserts were truly happening, I enabled the mysql "general query log" on my local test db:

Enable: SET global general_log = 1; SET global log_output = 'table';
View: SELECT * FROM mysql.general_log;
Disable: SET global general_log = 0; table mysql.general_log;

For large numbers of inserts, you could try batches of 10k rows at once.

Related: MySQL and JDBC with rewriteBatchedStatements=true

Incommode answered 21/5 at 23:6 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.