App engine consistent latency spikes under low load
Asked Answered
N

1

7

I've noticed periodic, but consistent latency spikes from my app running on app engine. At first I thought the network may be slow but app stats confirmed that not to be the case.

I've been able to reproduce the latency spikes using older and newer versions of the SDKs, currently I'm using the following:

  • App engine SDK: 1.9.42
  • Google cloud endpoints: 1.9.42
  • Objectify: 5.1.13
  • Appstats (to debug network latency)

So usage on the app is pretty low, over the last 30 days I'm generally under 0.04 requests a second:

requests per second

Most work is done with one instance as well: enter image description here

Most operations' latency is well under a second, but an alarming number of requests take 10-30x longer.

Latency density distribution

5% of requests take 23 seconds or longer...

So I figured it must just be network latency, but every appstat of a slower operation disproved this. Datastore and the network have always been incredibly reliable. Here is the anatomy of a slow request taking over 30 seconds:

app stats of operation taking 31 seconds

Here is the anatomy of a normal request: enter image description here

At a high level my code is pretty uninteresting: it's a simple api that makes a few network calls and saves / reads data from the cloud datastore. The entire source can be found on github here. The app runs on a single auto scaling app engine instance and is warmed up.

CPU usage over the last month doesnt seem to show anything exciting either: enter image description here

It's really strange to see that even for quick operations, a huge percentage of time is spent on the CPU, even though the code simply creates a few objects, persists them, and returns JSON. I'm wondering if the CPU is getting pegged on my app engine instance by another app which could be causing performance to periodically degrade.

My appengine.xml config looks like this:

<?xml version="1.0" encoding="utf-8"?>
<appengine-web-app xmlns="http://appengine.google.com/ns/1.0">
    <application>sauce-sync</application>
    <version>1</version>
    <threadsafe>true</threadsafe>
    <automatic-scaling>
        <!-- always keep an instance up in order to keep startup time low-->
        <min-idle-instances>1</min-idle-instances>
    </automatic-scaling>
</appengine-web-app>

And my web.xml looks like this:

<web-app xmlns="http://java.sun.com/xml/ns/javaee" version="2.5">
    <servlet>
        <servlet-name>SystemServiceServlet</servlet-name>
        <servlet-class>com.google.api.server.spi.SystemServiceServlet</servlet-class>
        <init-param>
            <param-name>services</param-name>
            <param-value>com.sauce.sync.SauceSyncEndpoint</param-value>
        </init-param>
    </servlet>
    <servlet-mapping>
        <servlet-name>SystemServiceServlet</servlet-name>
        <url-pattern>/_ah/spi/*</url-pattern>
    </servlet-mapping>

    <!--reaper-->
    <servlet>
        <servlet-name>reapercron</servlet-name>
        <servlet-class>com.sauce.sync.reaper.ReaperCronServlet</servlet-class>
    </servlet>
    <servlet-mapping>
        <servlet-name>reapercron</servlet-name>
        <url-pattern>/reapercron</url-pattern>
    </servlet-mapping>

    <servlet>
        <servlet-name>reaper</servlet-name>
        <servlet-class>com.sauce.sync.reaper.ReaperServlet</servlet-class>
    </servlet>
    <servlet-mapping>
        <servlet-name>reaper</servlet-name>
        <url-pattern>/reaper</url-pattern>
    </servlet-mapping>


    <welcome-file-list>
        <welcome-file>index.html</welcome-file>
    </welcome-file-list>

    <filter>
        <filter-name>ObjectifyFilter</filter-name>
        <filter-class>com.googlecode.objectify.ObjectifyFilter</filter-class>
    </filter>
    <filter-mapping>
        <filter-name>ObjectifyFilter</filter-name>
        <url-pattern>/*</url-pattern>
    </filter-mapping>
</web-app>

TLDR I'm completely stuck and I'm not sure how to debug or fix this issue and I'm starting to think this is business as usual for smaller apps on app engine.

I'm thinking about turning off the resident instance for a while hoping that my app has just been running some bunk hardware or along side an app that is consuming alot of resources. Has anyone run into similar performance issues or knows of additional ways to profile your app?

EDIT:

I've tried running on 1 resident instance, I've also tried setting concurrent requests at 2-4 per this question with no results. Logs and appstats both confirm that an inordinate amount of time is spent waiting for my code to initially run. Here is a request that takes 25 seconds before my first line of code is run, not sure what endpoints / app engine is doing in this time.

25 seconds before my code is run

Again load is still low and this is request is on a warmed up instance.

EDIT 2:

Seems like for whatever reason app engine + endpoints doesnt play well with min-idle-instances set. Reverting to the default app engine config fixed my problem.

enter image description here

Ninefold answered 21/10, 2016 at 14:33 Comment(3)
potentially related (but that one is seen under load): #37307961Karen
How many instances are typically active? Even though you have minimum idle instances set to 1, that doesn't mean there isn't a delay for new instances coming up.Officialese
Generally 1 instance, attached a graph with the number of instances to the original. I'm not sure thats indicative of much though, even cold instances booting up take far shorter than 30 seconds to start and complete a request. It seems more likely the high latency is causing additional nodes to spawn. Also all slow requests occur on the first instance as loading_request is not set for these slow requestsNinefold
W
3

I don't have an answer, but I can offer you some debugging tips.

Appstats may or may not be reporting correctly. However, log messages get timestamped. Log before & after each of the RPC operations. That should give you some insight.

30s delays sound a lot like warmup requests, which should be marked clearly in the logs. One thing that I've found in the past is that setting any resident instances for low-traffic apps (unintuitively) tended to route a lot of requests to cold instances. Use the default setting and set up a cron task to ping and endpoint once per minute.

Western answered 21/10, 2016 at 17:44 Comment(3)
App stats, logging, and google's trace tools on the cloud console have all confirmed RPCs are generally very quick. None of the slow requests are happening when 'loading_request=1' either. I'll look into your other suggestions though, I havent tried tinkering too much with resident instances. PS I love Objectify :)Ninefold
@Western if it'd be a warmup request wouldn't the RPCs be clustered towards the end of the trace (as they would only be called after the instance starts up)?Karen
Unintuitive indeed, for whatever reason the default config works best even though you'll have to incur loading requests once in a while ¯\_(ツ)_/¯Ninefold

© 2022 - 2024 — McMap. All rights reserved.