Stackdriver Trace with Google Cloud Run

Asked 3/12, 2019 at 13:9 Answered 31/8, 2021 at 12:36

Solved google-cloud-platform stackdriver google-cloud-run

I have been diving into a Stackdriver Trace integration on Google Cloud Run. I can get it to work with the agent, but I am bothered by a few questions.

Given that

The Stackdriver agent aggregates traces in a small buffer and sends them periodically.
CPU access is restricted when a Cloud Run service is not handling a request.
There is no shutdown hook for Cloud Run services; you can't clear the buffer before shutdown: the container just gets a SIGKILL. This is a signal you can't catch from your application.
Running a background process that sends information outside of the request-response cycle seems to violate the Knative Container Runtime contract
The collections of logging data is documented and does not require me to run an agent, but there is no such solution for telemetry.
I found one report of someone experiencing lost traces on Cloud Run using the agent-based approach

How Google does it

I went into the source code for the Cloud Endpoints ESP, (the Cloud Run integration is in beta) to see if they solve it in a different way, but there the same pattern is used: there is a buffer with traces (1s) and it is cleared periodically.

Question

While my tracing integration seems to work in my test setup, I am worried about incomplete and missing traces when I run this in a production environment.

Is this a hypothetical problem or a real issue?
It looks like the right way to approach this is to write telemetry to logs, instead of using an agent process. Is that supported with Stackdriver Trace?

Biostatics answered 3/12, 2019 at 13:9 Comment(4)

What a well written question!!! Nice! Thank you for this. – Rent 3/12, 2019 at 14:59

Related: #58261580 – Combustor 5/12, 2019 at 1:11

There seems to be a feature request for Cloud Run to send a SIGTERM before SIGKILL: issuetracker.google.com/issues/131849051 – Kinchinjunga 21/9, 2020 at 15:34

That feature is actually rolling out right now. – Biostatics 21/9, 2020 at 19:8

Cloud Run now supports sending SIGTERM. If your application handles SIGTERM it'll get 10 seconds grace time before shutdown.

You can use the 10 seconds to:

Flush buffers that have unsent data
Close connections to other systems

Docs: Container runtime contract

Biostatics answered 31/8, 2021 at 12:36 Comment(0)

Is this a hypothetical problem or a real issue?

If you consider a Cloud Run service receiving a single request, then it is definitely a problem, as the library will not have time to flush the data before the CPU of the container instance get throttled.

However, in real life use cases:

A Cloud Run service often receives requests continuously or frequently, which means that its container instance are going to either: continuously have CPU or have CPU available from time to time.
It is OK to drop traces: If some traces are not collected because the instance is turned down, it is likely that you have collected a diverse enough set of samples before this happens. Also, you might just be interested in the aggregated reports, in which case, collecting individual traces do not matter.

Note that Trace libraries usually themselves sample the requests to trace, they rarely trace 100% of the requests.

It looks like the right way to approach this is to write telemetry to logs, instead of using an agent process. Is that supported with Stackdriver Trace?

No, Stackdriver Trace takes its data from the spans sent to its API. Note that to send data to Stackdriver Trace, you can use libraryes like OpenCenss and OpenTelemetry, proprietary Stackdriver Trace libraries are not the recommended way anymre.

Voltz answered 5/12, 2019 at 5:16 Comment(2)

I think this assumption falls over in a use case like using Cloud Run for batch/cron jobs (say, once a day, or once every 2 hours). You get one request, you set sampling rate to 100%, but after the request is completed, there's a high chance you'll miss that once-a-day trace data. – Lifelike 5/12, 2019 at 19:19

Cloud Run now supports SIGTERM (see answer) – Biostatics 31/8, 2021 at 12:36

You're right. This is a fair concern since most tracing libraries tend to sample/upload trace spans in the background.

Since (1) your CPU is nearly scaled nearly to zero when the container isn't handling any requests and (2) the container instance can be killed any time due to inactivity, you cannot reliably upload those trace spans collected in your app. As you said, it may sometimes work since we don't fully stop CPU, but it won't always work.

It appears like some of the Stackdriver (and/or OpenTelemetry f.k.a. OpenCensus) libraries let you control the lifecycle of pushing trace spans.

For example, this Go package for OpenCensus Stackdriver exporter has a Flush() method that you can call before completing your request rather than relying on the runtime to periodically upload the trace spans: https://godoc.org/contrib.go.opencensus.io/exporter/stackdriver#Exporter.Flush

I assume other tracing libraries in other languages also expose similar Flush() methods, if not, please let me know in the comments and this would be a valid feature request to those libraries.

Lifelike answered 3/12, 2019 at 23:5 Comment(2)

The current node.js tracing agent library doesn't have a flush method :( – Gory 15/12, 2019 at 13:58

I think this would be a valid issue request to their GitHub repository. Also a valid use case for those of us at Google to do a survey of what's supporting this. Thanks for bringing up. – Lifelike 16/12, 2019 at 8:3