How to use Google Cloud PubSub and Run to handle resource-intensive long-running tasks?
H

2

14

I've got a Google Cloud PubSub topic which at times has thousands of messages and at times zero messages coming in. These messages represent tasks which can take upwards of an hour each. Preferably I'm able to use Cloud Run for this, as it scales really well to the demand, if a thousand messages gets published, I want 100s of Cloud Run instances to spin up. These Run instances get started by a push subscription. The problem is that PubSub has a 600 second timeout for the acknowledgement. This means in order to have Cloud Run process these messages they have to finish within 600 seconds. If they do not, PubSub times it out, and sends it again, causing the task to be restarted until the first task finally does acknowledge it (this causes the same task to be ran many times). Cloud Run acknowledges the messages by returning a 2** HTTP status code. The documentation states

When an application running on Cloud Run finishes handling a request, the container instance's access to CPU will be disabled or severely limited. Therefore, you should not start background threads or routines that run outside the scope of the request handlers.

So is it maybe possible to acknowledge a PubSub request through code and continue the processing, without having Google Cloud Run hand over the resources? Or is there a better solution I'm unaware of?

Because these processes are so code/resource-intensive, I feel Cloud Functions will not suffice. I've looked at https://cloud.google.com/solutions/using-cloud-pub-sub-long-running-tasks and https://cloud.google.com/blog/products/gcp/how-google-cloud-pubsub-supports-long-running-workloads. But these didn't answer my question. I've looked at Google Cloud Tasks, which might be something? But the rest of the project has been built around PubSub/Run/Functions, so preferably I stick with that.

This project is written in Python. So preferably I would like to write my Google Cloud Run tasks like this:

@app.route('/', methods=['POST'])
def index():
    """Endpoint for Google Cloud PubSub messages"""
    pubsub_message = request.get_json()
    logger.info(f'Received PubSub pubsub_message {pubsub_message}')
    if message_incorrect(pubsub_message):
        return "Invalid request", 400 #use normal NACK handling
    # acknowledge message here without returning

    # ...
    # Do actual processing of the task here
    # ...

So how can or should I solve this, so that the the resource-intensive tasks get properly scaled on demand ( so a push PubSub subscription ). And the tasks only get executed once.

Answers: In short what has been answered. Cloud Run and Functions are just not suited for this problem. There is no way to have them do tasks that take longer than 9 or 15 minutes respectively. The only solution is to switch over to another Google Service and use a pull style subscription and lose out on auto-scaling of GC Run/Functions

Hushhush answered 5/8, 2019 at 19:35 Comment(0)
G
5

Cloud Run on GKE can handle long process, more CPU and memory than available on managed platform. However, you have a GKE cluster always running and you loose the "pay-as-you-use" benefit.

If you want to use this solution, don't link directly PubSub push subscription to your Cloud Run on GKE. Use Cloud Task with HTTP job for this. The timeout is longer than PubSub (up to 24h instead of 10 min) and the retry policies are customizables.

Greg answered 5/8, 2019 at 20:14 Comment(0)
I
4

Neither Cloud Functions nor Cloud Run is sufficient for arbitrarily long running operations. Cloud Functions has a hard cap of 9 minutes per invocation, and Cloud Run caps at 60. If you need more time, you're going to have to delegate the work to another product, such as Google Compute Engine. It should be possible to kick off some Compute Engine work from one of the serverless products.

Give the limits of pubsub acks, you'll probably have to find a way for a client to be able to poll or listen to some resource to find out when the work is actually done. You could use a database for that, and Cloud Firestore lets you listen to documents to find out when they change. So you could use that to track the status of your long-running work.

Inimitable answered 5/8, 2019 at 19:46 Comment(2)
It is also worth noting that when using Google Cloud Pub/Sub outside of Cloud Run or Cloud Functions, one can call modAckDeadline to increase the deadline for acking a message if process is still ongoing. The Cloud Pub/Sub client libraries handle this automatically. The max amount of time to call modAckDeadline can be controlled via the max_lease_duration property.Dream
As of June 3, 2021, Request timeouts up to 60 minutes are now at general availability (GA). See release note.Nealy

© 2022 - 2024 — McMap. All rights reserved.