Cloud Dataflow what is the exact definition of freshness and latency?

Problem:

When using Cloud Dataflow, we get presented 2 metrics (see this page):

system latency
data freshness

These are also available in Stackdriver under the following names (extract from here):

system_lag: The current maximum duration that an item of data has been awaiting processing, in seconds.

data_watermark_age: The age (time since event timestamp) of the most recent item of data that has been fully processed by the pipeline.

But, these descriptions are still very vague:

what does "awaiting processing" mean? is how long a message waits in pubsub? or the total time it has to wait inside the pipeline?
the "maximum duration": after that maximum item is processed, will the metric be adjusted?
"time since event timestamp" does that mean if my event was put in pubsub at timestamp t1 and it flows out of one end of the pipeline at timestamp t2, the pipeline is at t1? I think I can assume that if the metric is at t1, everything before t1 can be assumed processed.

Question:

As these metrics coincide with the semantics of Apache Beam, I would love to see some examples, or at least more clear definitions of these metrics to make them usable.

These metrics are notoriously tricky. An in-depth dive into how they work can be seen in this talk by a member of the Beam / Dataflow team.

Pipelines are split in series of computations that occur in memory, and computations that require serializing your data to some sort of data store. For example, consider the following pipeline:

with Pipeline() as p:
  p | beam.ReadFromPubSub(...)  \
    | beam.Map(parse_data)
    | beam.Map(into_key_value_pairs) \
    | beam.WindowInto(....) \
    | beam.GroupByKey() \
    | beam.Map(format_data) \
    | beam.WriteToBigquery(...)

This pipeline would get broken up into two stages. A stage is a series of computations that can be applied in memory.

The first stage goes from ReadFromPubSub to the GroupByKey operation. Everything in between those two PTransforms can be done in-memory. To perform the GroupByKey, the data needs to be written to persistent state (and therefore into a new source).

The second stage goes from GroupByKey to WriteToBigQuery. In this case, the data is read from a 'source'.

Each source has its own set of watermarks. The watermarks that you see in the Dataflow UI are the maximum watermarks coming from any source in the pipeline.

Answering your questions:

What's awaiting processing?

Answer

It is how long an element waits in PubSub. Specifically, how long an element waits inside any source in the pipeline.

Consider a simpler pipeline:

ReadFromPubSub -> Map -> WriteToBigQuery.

This pipeline does the following operations for each item: Read an item from PubSub -> Operate on it -> Insert to BigQuery -> **Confirm to PubSub that the item has been consumed**.

Now, imagine that the BigQuery service goes down for 5 minutes. This means that PubSub will not receive confirmations for any of the elements for 5 minutes. Therefore, these elements will be stuck in PubSub for a while.

This means that the system latency (and the data freshness metric as well) will balloon up to 5 minutes while BQ writes are blocked.

Does maximum duration get adjusted after processing?

Answer

That's right. For instance, consider the previous pipeline again: BQ is dead for 5 minutes. When BQ comes back, a large batch of items may be written to it, and confirmed as read from PubSub. This will drastically reduce the system latency (and data freshness) back to a few seconds.

What's time since event timestamp?

Answer

An event timestamp can be provided as an attribute of the message to PubSub. It's a bit of a tricky concept, but essentially:

For each stage there is an output data watermark. An output data watermark of T indicates that the computation has processed all elements with event time before T. The latest an output data watermark can be is the earliest input watermark of all its upstream computations. However, the output watermark could be held back if there is some input data that has not yet been processed.

This metric is, of course, heuristic. If some data point comes in very late, then the Data Freshness will be held back.

I'd advice you to check out the talk by Slava. It goes over all these concepts.

Recommended topics

Hot tags