How to avoid memory overflow using high throughput JAVA I/O Stream from JDBC connectors?
Asked Answered
E

0

0

Consider the task of fetching a limited amount of data from a (w.r.t memory) infinite source, e.g. I need to fetch counts from an huge table consisting of billions and more entries, labeled by timestamp and foreign_key within a finite but possibly huge time window. But for later consumption I only need maximally gridSize values.

Note: The backing DB is a MariaDb and we use spring data jpa to connect to the database.

One stream based approach is:

int stepSize = (int) (numberOfCountsInWindow / gridSize) + 1;

StreamUtils
     .zipWithIndex(countStream)
     .filter(countIndexed -> countIndexed.getIndex() % stepSize == 0)
        ...
        <intermediate steps>
        ...
     .collect(Collectors.toList())

I've tried plenty of others, like:

  • a custom Collector, that uses an AtromicLong to decide if the next value is added to the final list or just ignored,
  • Flux.defere( () -> ), to use the Flux back-pressure management
  • Java 11
  • plenty of MySQL options like "useCursorFetch=true" in combination with a finite prefetch size as @QueryHint()
  • ...

However, all of what I've tried leads to GC overhead limit exceeded (with the heap size limited to 2048, which is the default for our applications).

The reason is, the cpu is totally busy garbage collecting, while the memory consumption is just filling up until finally the application crashes.

What I actually expected (or at least hoped for), was the stream to "realize" the filtering and just continue with the actual computing, at best using 100% cpu, and the garbage collector to easily remove the unused objects, since they are filtered out anyway and don't even need to be parsed (was hoping for the laziness of the stream to help here). However, this doesn't seem to work as I was hoping for.

Comments or suggestions are very welcome, since I'd like to (at least) understand (at best solve) rather then just accept the situation.

EDIT: Alternatives to MariaDB (maybe Cassandra?) are also welcome.

Equivalency answered 9/7, 2019 at 14:36 Comment(13)
JDBC drivers that leverage streams are still in the works. You're on your own until they arrive. Why do you think you need billions of records? Is there processing you could/should be doing on the database server that would not require you to bring all that data into the middle tier? Maybe this can help you: vladmihalcea.com/…Cartwright
Yeah, thanks for the comment. I actually knew this post. But I realized, the memory consumption to be way different, weather I access the data or just let the query run without doing anything with it, which seems to be not in accordance with "streams are not working". Apparently they work to certain extend, I just didn't entirely understand whats going on behind the scenes. The is also airpair.com/java/posts/spring-streams-memory-efficiency ... but It's too much assumed reader knowledge in my eyes.Equivalency
Did you actually do anything to verify your assumption that the memory would be sufficient for what you are trying to do, i.e. hold the result List? There is no magic solution to the problem of not having enough memory.Sile
@Sile Yes, of course. And yes the result list easily fits into memory. I spend a whole day with visualVM. Unfortunately I could only use the standard graphic since the memory profiling crashed too. As I said, I also tried only running the query, without further processing it. In this case memory was also sufficient or at least the garbage collector managed to clean fast enough.Equivalency
Activate the “heap dump on OutOfMemoryError” option and once you hit the limit, check where the memory is spent…Sile
Actually I was already watching the memory profiler and saw, it's mostly filled with byte arrays and second Object (which I think comes from the zipWithIndex since it creates intermediate Objects.Equivalency
Is there a reason for not performing the filtering in the database?Pindling
@Jens Schauder Yes. Specific: The filtering depends on a foreign key and the solutions from #31602925 are (to the extend I managed to test them) not really applicable within @Query annotation (stored procedures are no option for us, since it is just uncontrolled w.r.t. testing and readability). General: Data is growing, SQL is limited in function, so are SQL databases in resources. We thus rather tend to move logic into the Java services, that are easily horizontally scale able. Hence, searching general solutions.Equivalency
That linked question talks about Java again. I was thinking about something like select * from (select ROW_NUMBER() OVER (ORDER BY something DESC) AS row_num, x.* from (<original select>) as x) where mod(row_num, stepSize) = 0Pindling
I am so sorry, it was the wrong link. I actually ment this one: #859246Equivalency
And yes, select ... over looks exactly like what I need. However, this is only available from MariaDB 10.2 . Our production SLES a.t.m. is still at 10.0.35. But yes we already considered asking for an update. Nevertheless this is a specific solution, to a specific issue. Here, I was rather wondering about a more general (at best inside java) approach, to keep the number of tools and thus complexity lower. Comparison: We also us MQTT, but I rather stick to not using Broker features, especially no Broker specific ones.Equivalency
The heap dump analysis is not only for checking which objects exist but also why they exist, i.e. which references are preventing them from being garbage collected.Sile
@Sile Thanks for the hint. I'll try to check the references. The problem is however, that the heap dump also crashes a.t.m. but I think I can prevent this with a bit of tweaking.Equivalency

© 2022 - 2024 — McMap. All rights reserved.