We're in the process of improving performance of the our rails app hosted at Heroku (rails 3.2.8 and ruby 1.9.3). During this we've come across one alarming problem for which the source seems to be extremely difficult to track. Let me quickly explain how we experience the problem and how we've tried to isolate it.
--
Since around June we've experienced weird lag behavior in Time to First Byte all over the site. The problems is obvious from using the site (sometimes the application doesn't respond for 10-20 seconds), and it's also present in waterfall analysis via webpagetest.org. We're based in Denmark but get this result from any host.
To confirm the problem we've performed a benchmark test where we send 300 identical requests to a simple page and measured the response time. If we send 300 requests to the front page the median response time is below 1 second, which is fairly good. What scares us is that 60 requests takes more that double that time and 40 of those takes more than 4 seconds. Some requests take as much as 16 seconds.
None of these slow requests show up in New Relic, which we use for performance monitoring. No request queuing shows up and the results are the same no matter how high we scale our web processes. Still, we couldn't reject that the problem was caused by application code, so we tried another experiment where we responded to the request via rack middleware.
By placing this middleware (TestMiddleware) at the beginning of the rack stack, we returned a request before it even hit the application, ensuring that none of the following middleware or the rails app could cause the delay.
Middleware setup:
$ heroku run rake middleware
use Rack::Cache
use ActionDispatch::Static
use TestMiddleware
use Rack::Rewrite
use Rack::Lock
use Rack::Runtime
use Rack::MethodOverride
use ActionDispatch::RequestId
use Rails::Rack::Logger
use ActionDispatch::ShowExceptions
use ActionDispatch::DebugExceptions
use ActionDispatch::RemoteIp
use Rack::Sendfile
use ActionDispatch::Callbacks
use ActiveRecord::ConnectionAdapters::ConnectionManagement
use ActiveRecord::QueryCache
use ActionDispatch::Cookies
use ActionDispatch::Session::DalliStore
use ActionDispatch::Flash
use ActionDispatch::ParamsParser
use ActionDispatch::Head
use Rack::ConditionalGet
use Rack::ETag
use ActionDispatch::BestStandardsSupport
use NewRelic::Rack::BrowserMonitoring
use Rack::RailsExceptional
use OmniAuth::Builder
run AU::Application.routes
We then ran the same script to document response time and got pretty much the same result. The median response time was around 130ms (obviously faster because it doesn't hit the app. But still 60 requests took more than 400ms and 25 requests took more than 1 second. Again, with some requests as slow as 16 seconds.
One explanation could be related to slow hops on the network or DNS setup, but the results of traceroute looks perfectly OK.
This result was confirmed from running the response script on another rails 3.2 and ruby 1.9.3 application hosted on Heroku - no weird behavior at all.
The DNS setup follows Heroku's recommendations.
--
We're confused to say the least. Could there be something fishy with Heroku's routing network? Why the heck are we seeing this weird behavior? How do we get rid of it? And why can't we see it in New Relic?