Using Hibernate's ScrollableResults to slowly read 90 million records
Asked Answered
S

12

60

I simply need to read each row in a table in my MySQL database using Hibernate and write a file based on it. But there are 90 million rows and they are pretty big. So it seemed like the following would be appropriate:

ScrollableResults results = session.createQuery("SELECT person FROM Person person")
            .setReadOnly(true).setCacheable(false).scroll(ScrollMode.FORWARD_ONLY);
while (results.next())
    storeInFile(results.get()[0]);

The problem is the above will try and load all 90 million rows into RAM before moving on to the while loop... and that will kill my memory with OutOfMemoryError: Java heap space exceptions :(.

So I guess ScrollableResults isn't what I was looking for? What is the proper way to handle this? I don't mind if this while loop takes days (well I'd love it to not).

I guess the only other way to handle this is to use setFirstResult and setMaxResults to iterate through the results and just use regular Hibernate results instead of ScrollableResults. That feels like it will be inefficient though and will start taking a ridiculously long time when I'm calling setFirstResult on the 89 millionth row...

UPDATE: setFirstResult/setMaxResults doesn't work, it turns out to take an unusably long time to get to the offsets like I feared. There must be a solution here! Isn't this a pretty standard procedure?? I'm willing to forgo Hibernate and use JDBC or whatever it takes.

UPDATE 2: the solution I've come up with which works ok, not great, is basically of the form:

select * from person where id > <offset> and <other_conditions> limit 1

Since I have other conditions, even all in an index, it's still not as fast as I'd like it to be... so still open for other suggestions..

Sofko answered 13/5, 2010 at 11:21 Comment(1)
You may be able to partition your data so you don't have to read as much at a time, ref: #8326245Excruciation
T
33

Using setFirstResult and setMaxResults is your only option that I'm aware of.

Traditionally a scrollable resultset would only transfer rows to the client on an as required basis. Unfortunately the MySQL Connector/J actually fakes it, it executes the entire query and transports it to the client, so the driver actually has the entire result set loaded in RAM and will drip feed it to you (evidenced by your out of memory problems). You had the right idea, it's just shortcomings in the MySQL java driver.

I found no way to get around this, so went with loading large chunks using the regular setFirst/max methods. Sorry to be the bringer of bad news.

Just make sure to use a stateless session so there's no session level cache or dirty tracking etc.

EDIT:

Your UPDATE 2 is the best you're going to get unless you break out of the MySQL J/Connector. Though there's no reason you can't up the limit on the query. Provided you have enough RAM to hold the index this should be a somewhat cheap operation. I'd modify it slightly, and grab a batch at a time, and use the highest id of that batch to grab the next batch.

Note: this will only work if other_conditions use equality (no range conditions allowed) and have the last column of the index as id.

select * 
from person 
where id > <max_id_of_last_batch> and <other_conditions> 
order by id asc  
limit <batch_size>
Taxeme answered 13/5, 2010 at 11:56 Comment(5)
Using a StatelessSession is especially nice tip!Anya
setFirstResult and setMaxResults is not a viable option. I was right in my guess that it would be unusably slow. Maybe that works for tiny tables, but very quickly it just takes way too long. You can test this in the MySQL console by simply running "select * from anything limit 1 offset 3000000". That might take 30 minutes...Sofko
Running "select * from geoplanet_locations limit 1 offset 1900000;" against the YAHOO Geoplanet dataset (5 mil rows), returns in 1.34 seconds. If you have enough RAM to keep the index in RAM then I think your 30 minutes numbers are aways off. Funnily enough "select * from geoplanet_locations where id > 56047142 limit 10;" returns in essentially no time (regular client just returns 0.00).Taxeme
@Taxeme How did you find out that MySQL connector fakes the scrolling ? Is it written somewhere ? I am interested because I like to use the scroll feature with NHibernate and I am using mysql connector for .NET and I like to check if also Mysql .Net connector also fakes it, or depends on the version ?Typhoid
Anyone know if MySQL connector faking the scroll is still the case?Ormolu
M
22

You should be able to use a ScrollableResults, though it requires a few magic incantations to get working with MySQL. I wrote up my findings in a blog post (http://www.numerati.com/2012/06/26/reading-large-result-sets-with-hibernate-and-mysql/) but I'll summarize here:

"The [JDBC] documentation says:

To enable this functionality, create a Statement instance in the following manner:
stmt = conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY,
                java.sql.ResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(Integer.MIN_VALUE);

This can be done using the Query interface (this should work for Criteria as well) in version 3.2+ of the Hibernate API:

Query query = session.createQuery(query);
query.setReadOnly(true);
// MIN_VALUE gives hint to JDBC driver to stream results
query.setFetchSize(Integer.MIN_VALUE);
ScrollableResults results = query.scroll(ScrollMode.FORWARD_ONLY);
// iterate over results
while (results.next()) {
    Object row = results.get();
    // process row then release reference
    // you may need to evict() as well
}
results.close();

This allows you to stream over the result set, however Hibernate will still cache results in the Session, so you’ll need to call session.evict() or session.clear() every so often. If you are only reading data, you might consider using a StatelessSession, though you should read its documentation beforehand."

Manriquez answered 31/7, 2012 at 18:8 Comment(3)
Why would you Session#flush() with a read-only session ? Are you sure you did not mean Session#evict(row) or Session#clear() which would help keep level-1 cache size under control.Gluteal
(for followers, the code example used to mention flush but now mentions evict or clear)Excruciation
I used this with Postgres and I left out the setFetchSize. If you leave it in, it errors out.Lombardi
G
21

Set fetch size in query to an optimal value as given below.

Also, when caching is not required, it may be better to use StatelessSession.

ScrollableResults results = session.createQuery("SELECT person FROM Person person")
        .setReadOnly(true)
        .setFetchSize( 1000 ) // <<--- !!!!
        .setCacheable(false).scroll(ScrollMode.FORWARD_ONLY)
Gerdagerdeen answered 25/1, 2011 at 11:44 Comment(3)
This is the way to go. See javaquirks.blogspot.dk/2007/12/mysql-streaming-result-set.html for additional reference.Crescentic
So are you guys saying that for MYSql use Integer.MIN_VALUE but for Oracle or others you should set the fetch size to a reasonable number?Lichee
This solution is not database dependent. Same works for any database.Gerdagerdeen
Q
9

FetchSize must be Integer.MIN_VALUE, otherwise it won't work.

It must be literally taken from the official reference: https://dev.mysql.com/doc/connector-j/5.1/en/connector-j-reference-implementation-notes.html

Quench answered 18/1, 2012 at 19:39 Comment(0)
E
3

Actually you could have gotten what you wanted -- low-memory scrollable results with MySQL -- if you had used the answer mentioned here:

Streaming large result sets with MySQL

Note that you will have problems with Hibernate lazy-loading because it will throw an exception on any queries performed before the scroll is finished.

Enswathe answered 10/12, 2010 at 18:47 Comment(0)
A
1

With 90 million records, it sounds like you should be batching your SELECTs. I've done with with Oracle when doing the initial load into a distrbuted cache. Looking at the MySQL documentation, the equivalent seems to be using the LIMIT clause: http://dev.mysql.com/doc/refman/5.0/en/select.html

Here's an example:

SELECT * from Person
LIMIT 200, 100

This would return rows 201 through 300 of the Person table.

You'd need to get the record count from your table first and then divide it by your batch size and work out your looping and LIMIT parameters from there.

The other benefit of this would be parallelism - you can execute multiple threads in parallel on this for faster processing.

Processing 90 million records also doesn't sound like the sweet spot for using Hibernate.

Abubekr answered 13/5, 2010 at 11:45 Comment(2)
That doesn't work either... Try doing a select (batch or otherwise) where the offset is in the millions, it will take a very long time. I'm willing to bypass Hibernate, any suggestions for doing this without Hibernate?Sofko
Try this article for a solution to the LIMIT performance: facebook.com/note.php?note_id=206034210932Abubekr
H
1

The problem could be, that Hibernate keeps references to all objects in the session until you close the session. That has nothing to do with query caching. Maybe it would help to evict() the objects from the session, after you are done writing the object to the file. If they are no longer references by the session, the garbage collector can free the memory and you won't run out of memory anymore.

Herwig answered 15/7, 2010 at 13:19 Comment(2)
the problem is hibernate doesn't even return from the query until all rows are retrieved, so I couldn't even evict() anything until it's all loaded anyway.Sofko
Sorry, I missed that in the question. If it is really a problem with the MySQL driver then there is probably no other options then splitting the query yourself into multiple queries, as it i was already posted. I'm using ScrollableResults with the jTDS driver for MSSQL and that helped to prevent OutOfMemoryErrors when processing large datasets from a database, so the idea itself is probably not wrong.Herwig
F
1

I propose more than a sample code, but a query template based on Hibernate to do this workaround for you (pagination, scrolling and clearing Hibernate session).

It can also easily be adapted to use an EntityManager.

Foster answered 7/4, 2013 at 7:57 Comment(0)
I
0

I've used the Hibernate scroll functionality successfully before without it reading the entire result set in. Someone said that MySQL does not do true scroll cursors, but it claims to based on the JDBC dmd.supportsResultSetType(ResultSet.TYPE_SCROLL_INSENSITIVE) and searching around it seems like other people have used it. Make sure it's not caching the Person objects in the session - I've used it on SQL queries where there was no entity to cache. You can call evict at the end of the loop to be sure or test with a sql query. Also play around with setFetchSize to optimize the number of trips to the server.

Inexorable answered 13/5, 2010 at 18:25 Comment(0)
T
0

recently i worked over a problem like this, and i wrote a blog about how face that problem. is very like, i hope be helpfull for any one. i use lazy list approach with partial adquisition. i Replaced the limit and offset or the pagination of query to a manual pagination. In my example, the select returns 10 millions of records, i get them and insert them in a "temporal table":

create or replace function load_records ()
returns VOID as $$
BEGIN
drop sequence if exists temp_seq;
create temp sequence temp_seq;
insert into tmp_table
SELECT linea.*
FROM
(
select nextval('temp_seq') as ROWNUM,* from table1 t1
 join table2 t2 on (t2.fieldpk = t1.fieldpk)
 join table3 t3 on (t3.fieldpk = t2.fieldpk)
) linea;
END;
$$ language plpgsql;

after that, i can paginate without count each row but using the sequence assigned:

select * from tmp_table where counterrow >= 9000000 and counterrow <= 9025000

From java perspective, i implemented this pagination through partial adquisition with a lazy list. this is, a list that extends from Abstract list and implements get() method. The get method can use a data access interface to continue get next set of data and release the memory heap:

@Override
public E get(int index) {
  if (bufferParcial.size() <= (index - lastIndexRoulette))
  {
    lastIndexRoulette = index;
    bufferParcial.removeAll(bufferParcial);
    bufferParcial = new ArrayList<E>();
        bufferParcial.addAll(daoInterface.getBufferParcial());
    if (bufferParcial.isEmpty())
    {
        return null;
    }

  }
  return bufferParcial.get(index - lastIndexRoulette);<br>
}

by other hand, the data access interface use query to paginate and implements one method to iterate progressively, each 25000 records to complete it all.

results for this approach can be seen here http://www.arquitecturaysoftware.co/2013/10/laboratorio-1-iterar-millones-de.html

Thyme answered 28/10, 2013 at 16:18 Comment(1)
Note that link-only answers are discouraged, SO answers should be the end-point of a search for a solution (vs. yet another stopover of references, which tend to get stale over time). Please consider adding a stand-alone synopsis here, keeping the link as a reference.Viburnum
E
0

Another option if you're "running out of RAM" is to just request say, one column instead of the entire object How to use hibernate criteria to return only one element of an object instead the entire object? (saves a lot of CPU process time to boot).

Excruciation answered 16/4, 2015 at 21:15 Comment(0)
B
0

For me it worked properly when setting useCursors=true, otherwise The Scrollable Resultset ignores all the implementations of fetch size, in my case it was 5000 but Scrollable Resultset fetched millions of records at once causing excessive memory usage. underlying DB is MSSQLServer.

jdbc:jtds:sqlserver://localhost:1433/ACS;TDS=8.0;useCursors=true

Buntline answered 27/4, 2018 at 4:56 Comment(1)
Its useful to know it works for you, however the original question was about MySQL, so you might suggest checking out if the mysql driver has the useCursors option, or ask the user if he has tried thatSimonasimonds

© 2022 - 2024 — McMap. All rights reserved.