I have been diving into a Stackdriver Trace integration on Google Cloud Run. I can get it to work with the agent, but I am bothered by a few questions.
Given that
- The Stackdriver agent aggregates traces in a small buffer and sends them periodically.
- CPU access is restricted when a Cloud Run service is not handling a request.
- There is no shutdown hook for Cloud Run services; you can't clear the buffer before shutdown: the container just gets a SIGKILL. This is a signal you can't catch from your application.
- Running a background process that sends information outside of the request-response cycle seems to violate the Knative Container Runtime contract
- The collections of logging data is documented and does not require me to run an agent, but there is no such solution for telemetry.
- I found one report of someone experiencing lost traces on Cloud Run using the agent-based approach
How Google does it
I went into the source code for the Cloud Endpoints ESP, (the Cloud Run integration is in beta) to see if they solve it in a different way, but there the same pattern is used: there is a buffer with traces (1s) and it is cleared periodically.
Question
While my tracing integration seems to work in my test setup, I am worried about incomplete and missing traces when I run this in a production environment.
Is this a hypothetical problem or a real issue?
It looks like the right way to approach this is to write telemetry to logs, instead of using an agent process. Is that supported with Stackdriver Trace?