These metrics are notoriously tricky. An in-depth dive into how they work can be seen in this talk by a member of the Beam / Dataflow team.
Pipelines are split in series of computations that occur in memory, and computations that require serializing your data to some sort of data store. For example, consider the following pipeline:
with Pipeline() as p:
p | beam.ReadFromPubSub(...) \
| beam.Map(parse_data)
| beam.Map(into_key_value_pairs) \
| beam.WindowInto(....) \
| beam.GroupByKey() \
| beam.Map(format_data) \
| beam.WriteToBigquery(...)
This pipeline would get broken up into two stages. A stage is a series of computations that can be applied in memory.
The first stage goes from ReadFromPubSub
to the GroupByKey
operation. Everything in between those two PTransforms can be done in-memory. To perform the GroupByKey
, the data needs to be written to persistent state (and therefore into a new source).
The second stage goes from GroupByKey
to WriteToBigQuery
. In this case, the data is read from a 'source'.
Each source has its own set of watermarks. The watermarks that you see in the Dataflow UI are the maximum watermarks coming from any source in the pipeline.
--
Answering your questions:
- What's awaiting processing?
Answer
It is how long an element waits in PubSub. Specifically, how long an element waits inside any source in the pipeline.
Consider a simpler pipeline:
ReadFromPubSub -> Map -> WriteToBigQuery
.
This pipeline does the following operations for each item: Read an item from PubSub -> Operate on it -> Insert to BigQuery -> **Confirm to PubSub that the item has been consumed**
.
Now, imagine that the BigQuery service goes down for 5 minutes. This means that PubSub will not receive confirmations for any of the elements for 5 minutes. Therefore, these elements will be stuck in PubSub for a while.
This means that the system latency (and the data freshness metric as well) will balloon up to 5 minutes while BQ writes are blocked.
- Does maximum duration get adjusted after processing?
Answer
That's right. For instance, consider the previous pipeline again: BQ is dead for 5 minutes. When BQ comes back, a large batch of items may be written to it, and confirmed as read from PubSub. This will drastically reduce the system latency (and data freshness) back to a few seconds.
- What's time since event timestamp?
Answer
An event timestamp can be provided as an attribute of the message to PubSub. It's a bit of a tricky concept, but essentially:
For each stage there is an output data watermark. An output data watermark of T indicates that the computation has processed all elements with event time before T. The latest an output data watermark can be is the earliest input watermark of all its upstream computations. However, the output watermark could be held back if there is some input data that has not yet been processed.
This metric is, of course, heuristic. If some data point comes in very late, then the Data Freshness will be held back.
--
I'd advice you to check out the talk by Slava. It goes over all these concepts.