What is the purpose of data provenance in Apache NiFi Processors
Asked Answered
A

2

7

For every processor there is a way to configure the processor and there is a context menu to view data provenance.

Is there a good explanation of what is data provenance?

NiFi screenshot

Anesthesia answered 15/8, 2016 at 2:20 Comment(0)
D
17

Data provenance is all about understanding the origin and attribution of data. In a typical system you get 'logs'. When you consider data flowing through a series of processes and queues you end up with a lot of lots of course. If you want to follow the path a given piece of data took, or how long it took to take that path, or what happened to an object that got split up into different objects and so on all of that is really time consuming and tough. The provenance that NiFi supports is like logging on steroids and is all about keeping and tracking these relationships between data and the events that shaped and impacted what happened to it. NiFi is keeping track of where each piece of data comes from, what it learned about the data, maintains the trail across splits, joins, transformations, where it sends it, and ultimately when it drops the data. Think of it like a chain of custody for data.

This is really valuable for a few reasons. First, understanding and debugging. Having this provenance capture means from a given event you can go forwards or backwards in the flow to see where data came from and went. Given that NiFi also has an immutable versioned content store under the covers you can also use this to click directly to the content at each stage of the flow. You can also replay the content and context of a given event against the latest flow. This in turn means much faster iteration to the configuration and results you want. This provenance model is also valuable for compliance reasons. You can prove whether you sent data to the correct systems or not. If you learn that you didn't then have data with which you can address the issue or create a powerful audit trail for follow-up.

The provenance model in Apache NiFi is really powerful and it is being extended to the Apache MiNiFi which is a subproject of Apache NiFi as well. More systems producing more provenance will mean you have a far stronger ability to track data from end-to-end. Of course this becomes even more powerful when it can be combined with other lineage systems or centralized lineage stores. Apache Atlas may be a great system to integrate with for this to bring a centralized view. NiFi is able to not only do what I described above but to also send these events to such a central store. So, exciting times ahead for this.

Hope that helps.

Dispensable answered 15/8, 2016 at 3:15 Comment(1)
This is very helpful/Anesthesia
L
1

Relation to term data provenance:

A promising method of under- standing suspicious events is causal analysis, in which system audit logs are transformed into a data provenance graph that encodes causal dependencies and historical relationships between subjects (processes) and objects (files, sockets, etc.): https://dl.acm.org/doi/pdf/10.1145/3460120.3484551

Luedtke answered 6/9 at 11:15 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.