Apache NiFi - OutOfMemory Error: GC overhead limit exceeded on SplitText processor
I

2

8

I am trying to use NiFi to process large CSV files (potentially billions of records each) using HDF 1.2. I've implemented my flow, and everything is working fine for small files.

The problem is that if I try to push the file size to 100MB (1M records) I get a java.lang.OutOfMemoryError: GC overhead limit exceeded from the SplitText processor responsible of splitting the file into single records. I've searched for that, and it basically means that the garbage collector is executed for too long without obtaining much heap space. I expect this means that too many flow files are being generated too fast.

How can I solve this? I've tried changing nifi's configuration regarding the max heap space and other memory-related properties, but nothing seems to work.

Right now I added an intermediate SplitText with a line count of 1K and that allows me to avoid the error, but I don't see this as a solid solution for when the incoming file size will become potentially much more than that, I am afraid I will get the same behavior from the processor.

Any suggestion is welcomed! Thank you

Impatient answered 29/7, 2016 at 8:11 Comment(0)
C
9

The reason for the error is when splitting 1M records with a line count of 1, you are creating 1M flow files which equate 1M Java objects. Overall the approach of using two SplitText processors is common and avoids creating all of the objects at the same time. You could probably use an even larger split size on the first split, maybe 10k. For a billion records I am wondering if a third level would make sense, split from 1B to maybe 10M, then 10M to 10K, then 10K to 1, but I would have to play with it.

Some additional things to consider are increasing the default heap size from 512MB, which you may have already done, and also figuring out if you really need to split down to 1 line. It is hard to say without knowing anything else about the flow, but in a lot of cases if you want to deliver each line somewhere you could potentially have a processor that reads in a large delimited file and streams each line to the destination. For example, this is how PutKafka and PutSplunk work, they can take a file with 1M lines and stream each line to the destination.

Conversation answered 29/7, 2016 at 12:44 Comment(4)
If there's no "one-shot" way to do this I will definitely try multiple levels. Regarding PutKafka, I would end setting up Kafka together with NiFi in the cluster. Ignoring the fact that this will take some cluster resources, are there advantages from a performance or other standpoints?Thank you as always for the useful information about NiFi's behavior.Impatient
Well I wasn't necessarily saying you need Kafka as part of this, I was more asking about what you want to do in your flow after you have split down to 1 line per flow file, to see if you really need to do that. A lot of times people just want to deliver these lines to an external system, and in those cases it might be possible to have a processor that streams in the large file and sends each line somewhere without creating millions of flow files, kafka and splunk were just two examples of that.Conversation
I actually do need to split the files line by line, and then apply a different conversion/normalization on each of its fields. Then I get back together each line and export all on hive.Impatient
Just wondering why putting a back pressure threshold of 10K flowfiles on success queue should not help? This will block the SplitText processor from generating further files and will reduce the no. of objects in JVM. You can even try with 1K threshold depending on how big your flow is.Olivette
T
0

I had a similar error while using the GetMongo processor in Apache NiFi. I changed my configurations to:

Limit: 100
Batch Size: 10

Then the error disappeared.

Trilbee answered 12/10, 2021 at 11:34 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.