MergeContent with nifi - inconsistent length
Asked Answered
I

2

7

I am attempting to write a file on disk with the MergeContent processor, but I'm getting significantly varying file sizes - anywhere from one line to 806 lines. I've repeated the process many times over trying to figure out the newline demarcator as addressed in Apache NIFi MergeContent processor - set demarcator as new line and I've gotten really randomly sized files.

What parameters do I need to set to adhere to the following logic?

  1. Establish a single bin
  2. Route all flowfiles into bin
  3. If len(bin)>X or the age of the bin is greater than Max Bin Age, release the bin

To fully document, I currently have the following attributes defined: Merge Content Processor settings Merge Content Processor settings

As you can see, I've set "Max Bin Age" to "10 sec" following the syntax in https://github.com/apache/nifi/blob/31fba6b3332978ca2f6a1d693f6053d719fb9daa/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/test/java/org/apache/nifi/processors/standard/TestMergeContent.java#L219 (which is the only place I've managed to find an example of this value, the documentation seems incomplete on this parameter)

I've set "Maximum Number of Entries" to 5000, and "Maximum number of Bins" to 1

What do I need to do to aggregate my records following the logic above? I also tried using the "Correlation Attribute Name" parameter with an attribute guaranteed to be identical on all documents reaching this point, and saw the same

Intelligible answered 23/1, 2016 at 0:50 Comment(0)
R
7

The most important thing here is actually the minimum number of entries. What is happening is that the binning algorithm takes a lenient approach in terms of the number of items.

For your specific logic, you would want to let things as they stand and:

  • Set Minimum Number of Entries to 5000
  • Optionally, increase the maximum number of entries. Leaving it as configured will generate bins that are exactly 5000 entries except for those periods where the age interval has been eclipsed

Below is an image of the configuration above where min and max bin size are both 5000 and only 1 bin is handled at a time. In this case you'll see that exactly 20000 files have been merged into 4.

Sample execution for a min and max bin size of 5000

Ramentum answered 23/1, 2016 at 1:38 Comment(5)
Interesting, when I build a test flow with GenerateFlowFile I can get the behavior you're illustrating, but when I run it with my test data I'm still getting a really random distributions, these should have a min file size of 1000 and a timeout of 30 seconds: 541 99583 3566100 1453404639289.output.json 16 2920 107583 1453404678859.output.json 493 97853 3122398 1453404758883.output.json 16 3144 102679 1453404809634.output.json 9 1916 66075 1453404859568.output.json 33 6612 213507 1453404869690.output.jsonIntelligible
The variation is stemming from not having min and max equals and specifying a max age. In my example, there was no max age, 1 bin, and min=max=5000. The role of max bin age is to avoid starvation on the processor so that configuration avoids input being stuck indefinitely waiting for other content to arrive. To this end, there is always the chance for variance depending on the overall volume of input FlowFIles to this processor. To get a better feel of what is being done with this data, could you provide some additional context on your expectations for it beyond this point? Thanks!Ramentum
I had altered the processor to have 1000 minimum and 1000 maximum, single bin, with a 30 second timeout - the result was, in about 15 seconds, 20+ files with, for example, 541, 16, 493, 16, 9, and 33 lines a piece. I'm wondering if it has to do with some correlation strategy or something - most of the files are named in the "1453404639289.output.json" scheme (which is the format of the original input file names), but there are usually one or two files with a <uuid>.json filename instead.Intelligible
Having a hard time recreating. Would you mind opening up an issue with as much detail as possible concerning your case? issues.apache.org/jira/browse/NIFI/… Also include a template of the flow you are working with as a basis for additional exploration if it is possible. Thanks!Ramentum
I've submitted bug issues.apache.org/jira/browse/NIFI-1438. Thanks Aldrin! Marking this as the answer for now - if it turns out I was doing something dumb I'll update the thread - and I will if it actually turns out to be a bug too!Intelligible
C
0

In case anyone is having this exact issue, the cause may be not setting the schedule on the MergeContent processor. After a lot of troubleshooting, I realized that this is one of those processors where "0 sec" is not an appropriate schedule. I had already set my Min Entries to some high number and Max Entries. Max Bin Age was set to 5 min. It was the schedule that was causing the processor to keep grabbing flowfiles and bundling them up in random sizes.

Chainman answered 7/8, 2019 at 4:34 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.