How to profile inconsistent H12 timeouts on Heroku

Asked 17/1, 2013 at 10:34 Answered 10/3, 2015 at 22:29

ruby-on-rails performance heroku newrelic

My users are seeing occasional request timeouts on Heroku. Unfortunately I can not consistently reproduce them which makes them really hard to debug. There's plenty of opportunity to improve performance - e.g. by reducing the huge number of database queries per request and by adding more caching - but without profiling that's a shot in the dark.

According to our New Relic analytics, many requests take between 1 and 5 seconds on the server. I know that's too slow, but it nowhere near the 30 seconds needed for the timeout.

The error tab on New Relic shows me several different database queries where the timeout occurs, but these aren't particularly slow queries and it can be different queries for each crash. Also for the same URL it sometimes does and sometimes does not show a database query.

How do I find out what's going on in these particular cases? E.g. how do I see how much time it was spending in the database when the timeout occurred, as opposed to the time it spends in the database when there's no error?

One hypothesis I have is that the database gets locked in some cases; perhaps a combination of reading and writing.

Spanos answered 17/1, 2013 at 10:34 Comment(4)

Have you viewed the logs? Next time the issue happens, immediately go to your command prompt and enter 'heroku logs'. Posting the logs from the error will help us solve the issue. – Megrims 17/1, 2013 at 20:2

@BrianPetro I do have the logs, but they are different for each case - even on the same page - because it terminates a different place all the time. That's why I'm looking a more generic way to debug this. – Spanos 19/1, 2013 at 4:14

Update you post with some logs or the most relevant code. Otherwise I am afraid I can't be of much help. – Megrims 19/1, 2013 at 4:17

New Relic only shows average times in its graphs. I've found that unless you have very little traffic during an error such as this, it is of little help. The way to see individual traces is to look at the "app server traces" ...that is assuming new relic chose one of your slow transactions to profile. – Vuong 14/2, 2013 at 21:38

You may have already seen it, but Heroku has a doc with some good background about request timeouts.

If your requests are taking a long time, and the processes servicing them are not being killed before the requests complete, then they should be generating transaction traces that will provide details about individual transactions that took too long.

If you're using Unicorn, it's possible that this is not happening because the requests are taking long enough that they're hitting up against Unicorn's timeout (after which the workers servicing those requests will be forcibly killed, not giving the New Relic agent enough time to report back in).

I'd recommend a two-step approach:

Configure the rack-timeout middleware to have a timeout below Heroku's 30s timeout. If this works, it will terminate requests taking longer than the timeout by raising a Timeout::Error, and such requests should generate transaction traces in New Relic.
If that yields nothing (which it might, because rack-timeout relies on Ruby's stdlib Timeout class, which has some limitations), you can try bumping the Unicorn request handling timeout up from its default of 60s (assuming you're using Unicorn). Be aware that long-running requests will tie up a Unicorn worker for a longer period in this case, which may further slow down your site, so use this as a last resort.

Gravitation answered 9/4, 2013 at 5:43 Comment(0)

Two years late here. I have minimal experience with Ruby, but for Django the issue with Gunicorn is that it does not properly handle slow clients on Heroku because requests are not pre-buffered, meaning a server connection could be left waiting (blocking). This might be a helpful article to you, although it applies primarily to Gunicorn and Python.

Underclay answered 10/3, 2015 at 22:29 Comment(0)

You're pretty clearly hitting the issue with long running requests. Check out http://artsy.github.com/blog/2013/02/17/impact-of-heroku-routing-mesh-and-random-routing/ and upgrade to NewRelic RPM 3.5.7.59 - the wait time measuring will be accurately reported.

Petit answered 21/2, 2013 at 6:21 Comment(0)

Recommended topics

Hot tags