Strange TTFB (time to first byte) issue on Heroku
Asked Answered
D

2

22

We're in the process of improving performance of the our rails app hosted at Heroku (rails 3.2.8 and ruby 1.9.3). During this we've come across one alarming problem for which the source seems to be extremely difficult to track. Let me quickly explain how we experience the problem and how we've tried to isolate it.

--

Since around June we've experienced weird lag behavior in Time to First Byte all over the site. The problems is obvious from using the site (sometimes the application doesn't respond for 10-20 seconds), and it's also present in waterfall analysis via webpagetest.org. We're based in Denmark but get this result from any host.

To confirm the problem we've performed a benchmark test where we send 300 identical requests to a simple page and measured the response time. If we send 300 requests to the front page the median response time is below 1 second, which is fairly good. What scares us is that 60 requests takes more that double that time and 40 of those takes more than 4 seconds. Some requests take as much as 16 seconds.

None of these slow requests show up in New Relic, which we use for performance monitoring. No request queuing shows up and the results are the same no matter how high we scale our web processes. Still, we couldn't reject that the problem was caused by application code, so we tried another experiment where we responded to the request via rack middleware.

By placing this middleware (TestMiddleware) at the beginning of the rack stack, we returned a request before it even hit the application, ensuring that none of the following middleware or the rails app could cause the delay.

Middleware setup:
$ heroku run rake middleware
use Rack::Cache
use ActionDispatch::Static
use TestMiddleware
use Rack::Rewrite
use Rack::Lock
use Rack::Runtime
use Rack::MethodOverride
use ActionDispatch::RequestId
use Rails::Rack::Logger
use ActionDispatch::ShowExceptions
use ActionDispatch::DebugExceptions
use ActionDispatch::RemoteIp
use Rack::Sendfile
use ActionDispatch::Callbacks
use ActiveRecord::ConnectionAdapters::ConnectionManagement
use ActiveRecord::QueryCache
use ActionDispatch::Cookies
use ActionDispatch::Session::DalliStore
use ActionDispatch::Flash
use ActionDispatch::ParamsParser
use ActionDispatch::Head
use Rack::ConditionalGet
use Rack::ETag
use ActionDispatch::BestStandardsSupport
use NewRelic::Rack::BrowserMonitoring
use Rack::RailsExceptional
use OmniAuth::Builder
run AU::Application.routes

We then ran the same script to document response time and got pretty much the same result. The median response time was around 130ms (obviously faster because it doesn't hit the app. But still 60 requests took more than 400ms and 25 requests took more than 1 second. Again, with some requests as slow as 16 seconds.

One explanation could be related to slow hops on the network or DNS setup, but the results of traceroute looks perfectly OK.

This result was confirmed from running the response script on another rails 3.2 and ruby 1.9.3 application hosted on Heroku - no weird behavior at all.

The DNS setup follows Heroku's recommendations.

--

We're confused to say the least. Could there be something fishy with Heroku's routing network? Why the heck are we seeing this weird behavior? How do we get rid of it? And why can't we see it in New Relic?

Dishonor answered 29/8, 2012 at 15:11 Comment(4)
Is heroku terminating your instances because of inactivity (if you're on the free plan)?Undoubted
No unfortunatly not. We are running 3 dynosDishonor
Are the requests that you are sending to test your app sequential or in parallel? I.e. do you ever send more than 3 requests at one time (which would cause queueing?)Bordie
Try removing your dns setup from the chain - hit the app with the xxxx.herokuapp.com (or whatever it is on bamboo) instead of your own domain. If that fails, time to talk to herkou support I think.Barbet
D
24

It Turned out that it was a kind of request queuing. Sometimes, that web server was busy, and since heroku just routs randomly incoming requests randomly to any dyno, then I could end up in a queue behind a dyno, which was totally stuck due to e.g. database problems. The strange thing is, that this was hardly noticeable in new relic (it's a good idea to uncheck all other resources when viewing thins in their charts, then the queuing suddenly appears)

EDIT 21/2 2013: It has turned out, that the reason why it wasn't hardly noticeable in Newrelic was, that it wasn't measured! http://rapgenius.com/Lemon-money-trees-rap-genius-response-to-heroku-lyrics

We find this very frustrating, and we ended up leaving Heroku in favor of dedicated servers. This gave us 20 times better performance at a 1/10 of the cost. Additionally I must say that we are disappointed by Heroku who at the time this happened, denied that the slowness was due to their infrastructure even though we suspected it and highlighted it several times. We even got answers like this back:

Heroku 28/8 2012: "If you're not seeing request queueing or other slowness reported in New Relic, then this is likely not a server-side issue. Heroku's internal routing should take <1ms. None of our monitoring systems are indicating any routing problems currently."

Additionally we spoke to Newrelic who also seemed unaware of the issue, even though they according to them selfs has a very close work relationship with Heroku.

Newrelic 29/8 2012: "It looks like whatever is causing this is happening before the Ruby agent's visibility starts. The queue time that the agent records is from the time the request enters a dyno, so the slow down is occurring before then."

The bottom-line was, that we ended up spending hours and hours on optimizing code that wasn't really the bottleneck. Additionally running with a too high dyno scale in a desperate try to boost our performance, but the only thing that we really got from this was bigger receipts from both Heroku and Newrelic - NOT COOL. I'm glad that we changed.

PS. At that time there even was a bug that caused newrelic pro to be charged on ALL dynos even though we, (according to Newrelics own advice), had disabled the monitoring on our background worker processes. It took a lot of time and many emails before the mistake was admitted by both parties.

PPS. If you are not aware of the current ongoing discussion, then here is the link http://rapgenius.com/James-somers-herokus-ugly-secret-lyrics

EDIT 26/2 2013 Heroku has just announced in their newsletter, that Newrelic has released an update that apparently should cast some light on the situation at Heroku.

EDIT 8/4 2013 Heroku has just released an FAQ over the topic

Dishonor answered 31/10, 2012 at 9:9 Comment(3)
"We find this very frustrating, and we ended up leaving Heroku in favor of dedicated servers. This gave us 20 times better performance at a 1/10 of the cost." - is it "rhetorical exaggeration" or some how accurate approximations? If it's not exaggeration could you provide some more info tests/logs output, how many dynos/servers did/do you use?Review
It's close to accurate. Since then our server capacity has grown even more, so it's hard to compare now to then. However I remember that our avg. server load time in new relic went down from on avg. 1 sec + all the stange delays caused by bad routing (which was often 30 sec) to a stable 150ms on avg. On top of this we now have a huge overcapacity on background processing where I can easily run many times more sidekiq workers than I could with 8-10 dynos before. My setup is running on Hetzner see for yourself. We use these among others hetzner.de/hosting/produkte_rootserver/ex6sDishonor
I should mention that things come at a cost - there is not much help to get at Hetzner if you have problems - software wise, you are completely on your own, hardware wise it's a little better but response time is not very good (often 1 hour). So it's up to you to have more server and a good failover setup. However so far things has been way more stable and way better uptime for us, that it ever was on Heroku back then. I must say though, that I miss Herokus support and all their good documentation and their nifty tools. After all they really have a cool product for a lot of things.Dishonor
W
0

traceroute is not a good measure of problems in the network, its a tool that can find failures along the network, but it will not show you the best view.

Try just putting up a static webpage and hit it with the ip address from your webpage tester. If it is still slow, blame the network.

If for some reason it is fast, then you have a different issue.

Workbag answered 28/10, 2012 at 23:15 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.