Heroku Sporadic High Response Time
Asked Answered
R

2

26

This is very specific, but I will try to be brief:

We are running a Django app on Heroku. Three servers:

  1. test (1 web, 1 celery dyno)
  2. training (1 web, 1 celery dyno)
  3. prod (2 web, 1 celery dyno).

We are using Gunicorn with gevents and 4 workers on each dyno.

We are experiencing sporadic high service times. Here is an example from Logentries:

High Response Time:
heroku router - - at=info 
method=GET 
path="/accounts/login/" 
dyno=web.1 
connect=1ms 
service=6880ms 
status=200 
bytes=3562

I have been Googling this for weeks now. We are unable to reproduce at will but experience these alerts 0 to 5 times a day. Notable points:

  • Occurs on all three apps (all running similar code)
  • Occurs on different pages, including simple pages such as 404 and /admin
  • Occurs at random times
  • Occurs with varying throughput. One of our instances only drives 3 users/day. It is not related to sleeping dynos because we ping with New Relic and the issue can occur mid-session
  • Unable to reproduce at will. I have experienced this issue personally once. Clicking a page that normally executes in 500ms resulted in a 30 second delay and eventually an app error screen from Heroku's 30s timeout
  • High response times vary from 5000ms - 30000ms.
  • New Relic does not point to a specific issue. Here are the past few transactions and times:
    • RegexURLResolver.resolve 4,270ms
    • SessionMiddleware.process_request 2,750ms
    • Render login.html 1,230ms
    • WSGIHandler 1,390ms
    • The above are simple calls and do not normally take near that amount of time

What I have narrowed it down to:

  1. This article on Gunicorn and slow clients
    • I have seen this issue happen with slow clients but also at our office where we have a fiber connection.
  2. Gevent and async workers not playing nicely
    • We've switched to gunicorn sync workers and problem still persists.
  3. Gunicorn worker timeout
    • It's possible that workers are somehow being kept-alive in a null state.
  4. Insufficient workers / dynos
    • No indication of CPU/memory/db overutilization and New Relic doesn't display any indication of DB latency
  5. Noisy Neighbors
    • Among my multiple emails with Heroku, the support rep has mentioned at least one of my long requests was due to a noisy neighbor, but was not convinced that was the issue.
  6. Subdomain 301
    • The requests are coming through fine, but getting stuck randomly in the application.
  7. Dynos restarting
    • If this were the case, many users would be affected. Also, I can see that our dynos have not restarted recently.
  8. Heroku routing / service issue
    • It is possible that the Heroku service is less than advertised and this is simply a downside of using their service.

We have been having this issue for the past few months, but now that we are scaling it needs to be fixed. Any ideas would be much appreciated as I have exhausted nearly every SO or Google link.

Russell answered 16/3, 2015 at 22:28 Comment(3)
This seems like a good question, but may get better responses at serverfaultAvifauna
@Avifauna thanks but a user over there commented I should move it to SO :)Russell
oh man -- i don't think it's unreasonable to have it on both. Sounds like it could be a programming or deployment issue -- one site specializes in each.Avifauna
R
22

I have been in contact with the Heroku support team over the past 6 months. It has been a long period of narrowing down through trial/error, but we have identified the problem.

I eventually noticed these high response times corresponded with a sudden memory swap, and even though I was paying for a Standard Dyno (which was not idling), these memory swaps were taking place when my app had not received traffic recently. It was also clear by looking at the metrics charts that this was not a memory leak because the memory would plateau off. For example:

Sudden Memory Swap

After many discussions with their support team, I was provided this explanation:

Essentially, what happens is some backend runtimes end up with a combination of applications that end up using enough memory that the runtime has to swap. When that happens, a random set of dyno containers on the runtime are forced to swap arbitrarily by small amounts (note that "random" here is likely containers with memory that hasn't been accessed recently but is still resident in memory). At the same time, the apps that are using large amounts of memory also end up swapping heavily, which causes more iowait on the runtime than normal.

We haven't changed how tightly we pack runtimes at all since this issue started becoming more apparent, so our current hypothesis is that the issue may be coming from customers moving from versions of Ruby prior to 2.1 to 2.1+. Ruby makes up for a huge percentage of the applications that run on our platform and Ruby 2.1 made changes to it's GC that trades memory usage for speed (essentially, it GCs less frequently to get speed gains). This results in a notable increase in memory usage for any application moving from older versions of Ruby. As such, the same number of Ruby apps that maintained a certain memory usage level before would now start requiring more memory usage.

That phenomenon combined with misbehaving applications that have resource abuse on the platform hit a tipping point that got us to the situation we see now where dynos that shouldn't be swapping are. We have a few avenues of attack we're looking into, but for now a lot of the above is still a little bit speculative. We do know for sure that some of this is being caused by resource abusive applications though and that's why moving to Performance-M or Performance-L dynos (which have dedicated backend runtimes) shouldn't exhibit the problem. The only memory usage on those dynos will be your application's. So, if there's swap it'll be because your application is causing it.

I am confident this is the issue I and others have been experiencing, as it is related to the architecture itself and not to any combination of language/framework/configs.

There doesn't seem to be a good solution other than A) tough up and wait it out or B) switch to one of their dedicated instances

I am aware of the crowd that says "This is why you should use AWS", but I find the benefits that Heroku offers to outweigh some occasional high response times and their pricing has gotten better over the years. If you are suffering from the same issue, the "best solution" will be your choice. I will update this answer when I hear anything more.

Good luck!

Russell answered 8/9, 2015 at 19:23 Comment(5)
Great find, and way to stick with it! This definitely fits the description of what I was seeing with my app. I'm still super glad I'm on Linode now though. ;)Demilitarize
We have been having the same issues as everyone else on this questions, and I too have seen these strange swap issues. I never even put two and two together, going to keep and eye out and see if I can see the over lap. This is very helpful and hopefully now that they are admitting this exists the problem can be address.Vulgarian
Just checked we have a similar pattern with memory and lag also.Vulgarian
Thanks for posting this. We experience the issue as well with Rails stack. Switching to Performance-M instance is not the best scenario financially. We have peaks a few hours a day where we might need performant dyno, but the rest of the day two 2x dynos are enough, scaling with Adept plugin. Hopefully the issue is going to be resolved soon.Inconsolable
yeah ... a few years on i don't think this issue has been resolved. A url of mine on a standard 1x LogEntries says normally responds in 30ms a few times a day spikes to 30 seconds and triggers the timeout. Seen similar results in a few other heroku apps in different projects too - really simple API endpoints backed by redis caching normally returning in a few ms come back with timeouts or really close to that 30 second mark. ... considering a move to Docker Cloud deployed to DigitalOcean. I don't enjoy AWS' pricing :(Reamonn
D
4

Not sure if this will help at all, but I'm going through the same thing with a Rails app right now on Heroku -- seemingly nondeterministic sporadicly high request times. For example, HEAD New Relic uptime pings to my site index that normally take 2-5ms taking 5 seconds, or rendering my site login, which normally sub-second taking 12 seconds. Also occasionally get random 30s timeouts. Here's what Heroku's support had to say in my case (for some of the instances at least):

The one earlier today looks like a big chunk of Request Queueing following a restart. If you want to avoid those, you might want to take a look at our Preboot feature. This will allow you to boot up a matched set of dynos after a deployment, then turn requests over to them instead of kicking over the existing dynos and forcing the request queueing.

I should note, this was one of Heroku's standard dyno restarts, not a deploy of mine or anything. Despite the caveats on the preboot page, I enabled it a few minutes ago, so we'll see if it makes any difference in my case. Hope that might help, as I've been pulling my hair out over this too!

Demilitarize answered 18/3, 2015 at 1:16 Comment(5)
Thanks Cade! It's good to hear I'm not alone. Our team has experienced high response times during deployments, but my issues above do not occur during those times. Heroku support has yet to respond to me. Let's keep each other posted!Russell
Currently also experiencing the same on our app on Heroku (High Response Time and Exit Timeout issues revealed by Logentries alerts). Made us deep dive into many aspects to try to figure it out (with positive side effects in terms of optimization...) but without being able to delineate what is deterministic and what is not (at our level).Finial
By way of follow-up, after having switched on pre-booting, I still got two slow request notifications over the weekend. Pages taking multiple seconds to load that, when re-visited, load almost immediately. So, still no answers here. More and more I'm contemplating just biting the bullet and moving to AWS, as I'm about at the end of what I can do to rectify this with Heroku. :(Demilitarize
@Demilitarize - checking in: any new info around the situation? If you moved to AWS, are you still experiencing the issue?Russell
I actually got so fed up I moved to Linode. Since moving to Linode in April, the sporadic slow response times have completely gone away. Additional upside: my response times and app performance are several times faster and I'm paying 78% less. Downside: devops is now my problem.Demilitarize

© 2022 - 2024 — McMap. All rights reserved.