What response times can be expected from GAE/NDB?
Asked Answered
D

2

6

We are currently building a small and simple central HTTP service that maps "external identities" (like a facebook id) to an "internal (uu)id", unique across all our services to help with analytics.

The first prototype in "our stack" (flask+postgresql) was done within a day. But since we want the service to (almost) never fail and scale automagically, we decided to use Google App Engine.

After a week of reading&trying&benchmarking this question emerges:

What response times are considered "normal" on App Engine (with NDB)?

We are getting response times that are consistently above 500ms on average and well above 1s in the 90percentile.

I've attached a stripped down version of our code below, hoping somebody can point out the obvious flaw. We really like the autoscaling and the distributed storage, but we can not imagine 500ms really is the expected performance in our case. The sql based prototype responded much faster (consistently), hosted on one single Heroku dyno using the free, cache-less postgresql (even with an ORM).

We tried both synchronous and asynchronous variants of the code below and looked at the appstats profile. It's always RPC calls (both memcache and datastore) that take very long (50ms-100ms), made worse by the fact that there are always multiple calls (eg. mc.get() + ds.get() + ds.set() on a write). We also tried deferring as much as possible to the task queue, without noticeable gains.

import json
import uuid

from google.appengine.ext import ndb

import webapp2
from webapp2_extras.routes import RedirectRoute


def _parse_request(request):
    if request.content_type == 'application/json':
        try:
            body_json = json.loads(request.body)
            provider_name = body_json.get('provider_name', None)
            provider_user_id = body_json.get('provider_user_id', None)
        except ValueError:
            return webapp2.abort(400, detail='invalid json')
    else:
        provider_name = request.params.get('provider_name', None)
        provider_user_id = request.params.get('provider_user_id', None)

    return provider_name, provider_user_id


class Provider(ndb.Model):
    name = ndb.StringProperty(required=True)


class Identity(ndb.Model):
    user = ndb.KeyProperty(kind='GlobalUser')


class GlobalUser(ndb.Model):
    uuid = ndb.StringProperty(required=True)

    @property
    def identities(self):
        return Identity.query(Identity.user==self.key).fetch()


class ResolveHandler(webapp2.RequestHandler):
    @ndb.toplevel
    def post(self):
        provider_name, provider_user_id = _parse_request(self.request)

        if not provider_name or not provider_user_id:
            return self.abort(400, detail='missing provider_name and/or provider_user_id')

        identity = ndb.Key(Provider, provider_name, Identity, provider_user_id).get()

        if identity:
            user_uuid = identity.user.id()
        else:
            user_uuid = uuid.uuid4().hex

            GlobalUser(
                id=user_uuid,
                uuid=user_uuid
            ).put_async()

            Identity(
                parent=ndb.Key(Provider, provider_name),
                id=provider_user_id,
                user=ndb.Key(GlobalUser, user_uuid)
            ).put_async()

        return webapp2.Response(
            status='200 OK',
            content_type='application/json',
            body = json.dumps({
                'provider_name' : provider_name,
                'provider_user_id' : provider_user_id,
                'uuid' : user_uuid
            })
        )

app = webapp2.WSGIApplication([
      RedirectRoute('/v1/resolve', ResolveHandler, 'resolve', strict_slash=True)
], debug=False)

For completeness sake the (almost default) app.yaml

application: GAE_APP_IDENTIFIER
version: 1
runtime: python27
api_version: 1
threadsafe: yes

handlers:
- url: .*
  script: main.app

libraries:
- name: webapp2
  version: 2.5.2
- name: webob
  version: 1.2.3

inbound_services:
- warmup
Descendent answered 14/2, 2013 at 15:25 Comment(0)
G
3

In my experience, RPC performance fluctuates by orders of magnitude, between 5ms-100ms for a datastore get. I suspect it's related to the GAE datacenter load. Sometimes it gets better, sometimes it gets worse.

Your operation looks very simple. I expect that with 3 requests, it should take about 20ms, but it could be up to 300ms. A sustained average of 500ms sounds very high though.

ndb does local caching when fetching objects by ID. That should kick in if you're accessing the same users, and those requests should be much faster.

I assume you're doing perf testing on the production and not dev_appserver. dev_appserver performance is not representative.

Not sure how many iterations you've tested, but you might want to try a larger number to see if 500ms is really your average.

When you're blocked on simple RPC calls, there's not too optimizing you can do.

Griffiths answered 14/2, 2013 at 16:15 Comment(5)
Yepp, you are right about dev_appserver's performance (sqlite on ssd...), so we test on production (payed account even). Concerning the iterations we usually keep the tests running for around 5 minutes. We also try to make sure each run has comparable amounts of hits/misses (by emptying the datastore/memcache between runs or by playing around with the range 'provider_user_id' is in).Descendent
One note: if you're running a big benchmark, you have to spin up your traffic gradually (say 5-10 minutes) and then sustain it for a while (another 5-10 minutes) to measure realistic effects. App Engine won't spin up the necessary instances immediately when your load goes from 0 to 100; there's a "governor" on this process to avoid instabilities.Sapwood
I just read about HRD's "one write per second per entity group" behavior. In the code above, wouldn't his explain our issues? There only are a handful of providers (mostly facebook), and Identity has Provider as a parent, making them an entity group?Descendent
I haven't profiled entity group performance. Yes that could be a culprit, but I'm not certain. With the code you've shown, the entity group is not necessary, since you're not actually making any ancestor queries. Getting entities by id is already strongly consistent. Since you're getting entities by id, I wonder if there's any significant perf impact from the entity group in your testing. Since your users are not well distributed amongst entitiy groups, you will see perf hits and datastore errors as you scale up your data set.Griffiths
Getting rid of the entity group did not improve the results sadly. Thanks for your input, we'll probably stick with gae+cloudsql for now.Descendent
J
1

The 1st obvious moment I see: do you really need a transaction on every request?

I believe that unless most of your requests create new entities it's better to do .get_by_id() outside of transaction. And if entity not found then start transaction or even better defer creation of the entity.

def request_handler(key, data):
  entity = key.get()
  if entity:
    return 'ok'
  else:
    defer(_deferred_create, key, data)
    return 'ok'

def _deferred_create(key, data):
  @ndb.transactional
  def _tx():
    entity = key.get()
    if not entity:
       entity = CreateEntity(data)
       entity.put()
  _tx()

That should give much better response time for user facing requests.

The 2nd and only optimization I see is to use ndb.put_multi() to minimize RPC calls.

P.S. Not 100% sure but you can try to disable multithreading (threadsave: no) to get more stable response time.

Jemimah answered 15/2, 2013 at 22:14 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.