In Nifi, what is the difference between FirstInFirstOutPrioritizer and OldestFlowFileFirstPrioritizer

Asked 5/4, 2018 at 14:48 Answered 5/4, 2018 at 17:8

Solved hadoop apache-nifi hortonworks-dataflow

User guide https://nifi.apache.org/docs/nifi-docs/html/user-guide.html has the below details on prioritizers, could you please help me understand how these are different and provide any real time example.

FirstInFirstOutPrioritizer: Given two FlowFiles, the one that reached the connection first will be processed first.

OldestFlowFileFirstPrioritizer: Given two FlowFiles, the one that is oldest in the dataflow will be processed first. 'This is the default scheme that is used if no prioritizers are selected.'

Inutility answered 5/4, 2018 at 14:48 Comment(0)

Imagine two processors A and B that are both connected to a funnel, and then the funnel connects to processor C.

Scenario 1 - The connection between the funnel and processor C has first-in-first-out prioritizer.

In this case, the flow files in the queue between the funnel and connection C will be processed strictly based on the order they reached the queue.

Scenario 2 - The connection between the funnel and processor C has oldest-flow-file-first prioritizer.

In this case, there could already be flow files in the queue between the funnel and connection C, but one of the processors transfers a flow to that queue that is older than all the flow files in that queue, it will jump to the front.

You could imagine that some flow files come from a different portion of the flow that takes way longer to process than other flow files, but they both end up funneled into the same queue, so these flow files from the longer processing part are considered older.

Sublunar answered 5/4, 2018 at 16:51 Comment(0)

Apache NiFi handles data from many disparate sources and can route it through a number of different processors. Let's use the following example (ignore the processor types, just focus on the titles):

First, the relative rate of incoming data can be different depending on the source/ingestion point. In this case, the database poll is being done once per minute, while the HTTP poll is every 5 seconds, and the file tailing is every second. So even if a database record is 59 seconds "older" than another, if they are captured in the same execution of the processor, they will enter NiFi at the same time and the flowfile(s) (depending on splitting) will have the same origin time.

If some data coming into the system "is dirty", it gets routed to a processor which "cleans" it. This processor takes 3 seconds to execute.

If both the clean relationship and the success relationship from "Clean Data" went directly to "Process Data", you wouldn't be able to control the order that those flowfiles were processed. However, because there is a funnel that merges those queues, you can choose a prioritizer on the queued queue, and control that order. Do you want the first flowfile to enter that queue processed first, or do you want flowfiles that entered NiFi earlier to be processed first, even if they entered this specific queue after a newer flowfile?

This is a contrived example, but you can apply this to disaster recovery situations where some data was missed for a time window and is now being recovered, or a flow that processes time-sensitive data and the insights aren't valid after a certain period of time has passed. If using backpressure or getting data in large (slow) batches, you can see how in some cases, oldest first is less valuable and vice versa.

Cockfight answered 5/4, 2018 at 17:8 Comment(0)

Recommended topics

Hot tags