Why do processes running on Google App Engine hang?
Asked Answered
H

1

7

We have a cron job that runs every hour on a backend module and creates tasks. The cron job runs queries on the Cloud SQL database, and the tasks make HTTP calls to other servers and also update the database. Normally they run great, even when thousands of tasks as created, but sometimes it gets "stuck" and there is nothing in the logs that can shed some light on the situation. For example, yesterday we monitored the cron job while it created a few tens of tasks and then it stopped, along with 8 of the tasks that also got stuck in the queue. When it was obvious that nothing was happening we ran the process a few more times and each time completed successfully.

After a day the original task was killed with a DeadlineExceededException and then the 8 other tasks, that were apparently running in the same instance, were killed with the following message: A problem was encountered with the process that handled this request, causing it to exit. This is likely to cause a new process to be used for the next request to your application. If you see this message frequently, you may be throwing exceptions during the initialization of your application. (Error code 104)

Until the processes were killed we saw absolutely no record of them in the logs, and now that we see them there are no log records before the time of the DeadlineExceededException, so we have no idea at what point they got stuck. We suspected that there is some lock in the database, but we see in the following link that there is a 10 minute limit for queries, so that would cause the process to fail much sooner than one day: https://cloud.google.com/appengine/docs/java/cloud-sql/#Java_Size_and_access_limits

Our module's class and scaling configuration is:

<instance-class>B4</instance-class>
<basic-scaling>
    <max-instances>11</max-instances>
    <idle-timeout>10m</idle-timeout>
</basic-scaling>

The configuration of the queue is:

<rate>5/s</rate>
<max-concurrent-requests>100</max-concurrent-requests>
<mode>push</mode>
<retry-parameters>
    <task-retry-limit>5</task-retry-limit>
    <min-backoff-seconds>10</min-backoff-seconds>
    <max-backoff-seconds>200</max-backoff-seconds>
</retry-parameters>

I uploaded some images of the trace data for the cron job: https://i.sstatic.net/1VZar.jpg. This includes the trace summary, and the beginning/ending of the timeline. There is no trace data for the 8 terminated tasks.

What could be the cause of this and how can we investigate it further?

Headpin answered 22/1, 2016 at 17:10 Comment(8)
Could you try enabling Cloud Trace and then including a trace of one of your slow requests?Skimpy
@Skimpy I edited my question to include trace data for the cron job: imgur.com/a/H5wGG. his includes the trace summary, and the beginning/ending of the timeline. There is no trace data for the 8 terminated tasks.Headpin
A standard practice is to isolate the issue first. Can you just make your query to the database and see how long it takes for the query to run? It may not be straight forward by using Google App Engine, so you may want to run your query from your local machine after authorizing the IP of your local machine to the Cloud SQL instance, etc.Gusgusba
The 24h timeout is because you are using manual scaling. The trace you have included rules out some API issues, but not cloud sql (which does not show up on the trace as it is not a simple request-response protocol). Next you might want to try "show processlist" on the Cloud SQL host while the process has hung, to see if it is connected/what it is doing. That will allow us to determine if it is hanging due to a slow Cloud SQL query or deadlock.Skimpy
@Skimpy Right now there are some tasks that have been stuck for a few hours. The "SHOW PROCESSLIST" shows 17 lines where the command is "Sleep", the user is "root" and the host is "localhost". 8 of them have the "time" 10199 (2+ hours), 8 have the "time" 8656 (2+ hours), and 1 has the "time" 243 (4 minutes, this last one keeps renewing itself). These are the only results of "SHOW PROCESSLIST", besides the record that shows me running this query. There are no other requests that are being handled on this server.Headpin
So it's not spending time waiting for SQL query to execute, and it's not spending time on other API calls. At this point it might be worth instrumenting your requests by having them log messages during the work they do. Then we can see how far it gets.Skimpy
@Skimpy I indeed have many log messages during the process, including a log line before anything else happens. Unfortunately I can't see any of these lines in the output, I only see the output of the DeadlineExceededException. Yesterday this happened again but this time everything was stuck for ~4 hours and then died with the error "The process handling this request unexpectedly died. This is likely to cause a new process to be used for the next request to your application. (Error code 203)". These requests also have no other logs besides the error message.Headpin
In the trace we see urlfetch, memcache, images and task queue calls. Do you know where in your code those calls are made? Are they past log messages that aren't getting printed?Skimpy
H
2

We eventually managed to solve the problem with the following steps:

  1. We split the module into two - one module to run the cron job and one module to handle the generated tasks. This allowed us to see that the problem was with handling the tasks as that was the only module that kept getting stuck.
  2. We limited the amount of concurrent tasks to 2, which seems to be the maximum amount of tasks that can be processed simultaneously without the system getting stuck.
Headpin answered 5/3, 2016 at 15:58 Comment(3)
did you see any slowness in while handling tasks because you set concurrent tasks to 2. Any other issues are you facing with this approach ??.Emlynn
We didn't have any problems with this approach and we found that the throughput of the tasks was good enough for us.Headpin
I've run across this issue, and I also don't have a full-proof solution, but I found that it may be related to write contention issues on the database. Eliminating writes to the same objects across threads can help.Andvari

© 2022 - 2024 — McMap. All rights reserved.