Consider the task of fetching a limited amount of data from a (w.r.t memory) infinite source, e.g. I need to fetch counts
from an huge table consisting of billions and more entries, labeled by timestamp
and foreign_key
within a finite but possibly huge time window. But for later consumption I only need maximally gridSize
values.
Note: The backing DB is a MariaDb and we use spring data jpa to connect to the database.
One stream based approach is:
int stepSize = (int) (numberOfCountsInWindow / gridSize) + 1;
StreamUtils
.zipWithIndex(countStream)
.filter(countIndexed -> countIndexed.getIndex() % stepSize == 0)
...
<intermediate steps>
...
.collect(Collectors.toList())
I've tried plenty of others, like:
- a custom
Collector
, that uses anAtromicLong
to decide if the next value is added to the final list or just ignored, - Flux.defere( () -> ), to use the Flux back-pressure management
- Java 11
- plenty of MySQL options like "useCursorFetch=true" in combination with a finite prefetch size as
@QueryHint()
- ...
However, all of what I've tried leads to GC overhead limit exceeded
(with the heap size limited to 2048, which is the default for our applications).
The reason is, the cpu is totally busy garbage collecting, while the memory consumption is just filling up until finally the application crashes.
What I actually expected (or at least hoped for), was the stream to "realize" the filtering and just continue with the actual computing, at best using 100% cpu, and the garbage collector to easily remove the unused objects, since they are filtered out anyway and don't even need to be parsed (was hoping for the laziness of the stream to help here). However, this doesn't seem to work as I was hoping for.
Comments or suggestions are very welcome, since I'd like to (at least) understand (at best solve) rather then just accept the situation.
EDIT: Alternatives to MariaDB (maybe Cassandra?) are also welcome.
List
? There is no magic solution to the problem of not having enough memory. – SileObject
(which I think comes from thezipWithIndex
since it creates intermediateObject
s. – Equivalency@Query
annotation (stored procedures are no option for us, since it is just uncontrolled w.r.t. testing and readability). General: Data is growing, SQL is limited in function, so are SQL databases in resources. We thus rather tend to move logic into the Java services, that are easily horizontally scale able. Hence, searching general solutions. – Equivalencyselect * from (select ROW_NUMBER() OVER (ORDER BY something DESC) AS row_num, x.* from (<original select>) as x) where mod(row_num, stepSize) = 0
– Pindlingselect ... over
looks exactly like what I need. However, this is only available from MariaDB 10.2 . Our production SLES a.t.m. is still at 10.0.35. But yes we already considered asking for an update. Nevertheless this is a specific solution, to a specific issue. Here, I was rather wondering about a more general (at best inside java) approach, to keep the number of tools and thus complexity lower. Comparison: We also us MQTT, but I rather stick to not using Broker features, especially no Broker specific ones. – Equivalency