Using yield with multiple ndb.get_multi_async
Asked Answered
T

3

8

I am trying to improve efficiency of my current query from appengine datastore. Currently, I am using a synchronous method:

class Hospital(ndb.Model):
      name = ndb.StringProperty()
      buildings= ndb.KeyProperty(kind=Building,repeated=True)
class Building(ndb.Model):
      name = ndb.StringProperty()
      rooms= ndb.KeyProperty(kind=Room,repeated=True)
class Room(ndb.Model):
      name = ndb.StringProperty()
      beds = ndb.KeyProperty(kind=Bed,repeated=True)
class Bed(ndb.Model):
      name = ndb.StringProperty()
      .....

Currently I go through stupidly:

currhosp = ndb.Key(urlsafe=valid_hosp_key).get()
nbuilds = ndb.get_multi(currhosp.buildings)
for b in nbuilds:
   rms = ndb.get_multi(b.rooms)
   for r in rms:
      bds = ndb.get_multi(r.beds)
      for b in bds:
          do something with b object

I would like to transform this into a much faster query using get_multi_async

My difficulty is in how I can do this? Any ideas?

Best Jon

Ticktacktoe answered 11/1, 2013 at 0:45 Comment(0)
T
11

using the given structures above, it is possible, and was confirmed that you can solve this with a set of tasklets. It is a SIGNIFICANT speed up over the iterative method.

@ndb.tasklet
def get_bed_info(bed_key):
    bed_info = {}
    bed = yield bed_key.get_async()
    format and store bed information into bed_info
    raise ndb.Return(bed_info)

@nbd.tasklet
def get_room_info(room_key):
    room_info = {}
    room = yield room_key.get_async()
    beds = yield map(get_bed_info,room.beds)
    store room info in room_info
    room_info["beds"] = beds
    raise ndb.Return(room_info)

@ndb.tasklet
def get_building_info(build_key):
    build_info = {}
    building = yield build_key.get_async()
    rooms = yield map(get_room_info,building.rooms)
    store building info in build_info
    build_info["rooms"] = rooms
    raise ndb.Return(build_info)

@ndb.toplevel
def get_hospital_buildings(hospital_object):
    buildings = yield map(get_building_info,hospital_object.buildings)
    raise ndb.Return(buildings)

Now comes the main call from the hospital function where you have the hospital object (hosp).

hosp_info = {}
buildings = get_hospital_buildings(hospital_obj)
store hospital info in hosp_info
hosp_info["buildings"] = buildings
return hosp_info

There you go! It is incredibly efficient and lets the schedule complete all the information in the fastest possible manner within the GAE backbone.

Ticktacktoe answered 14/1, 2013 at 6:13 Comment(6)
You can (and should!) accept your own answer. It's good SO karma.Externalization
First @GuidovanRossum please go back to app engine : ). Sorry, on to question. Why do you have outer func with toplevel? Is that not just for request handlers? Curious because it still does a raise ndb.Return?Coffeehouse
Nevermind figured it out. This code helped a lot with my situation. Really apprecaite it!Coffeehouse
In case anyone cares, just found out from testing. Doing a ndb.multi_get() on the initial hospital_object.buildings keys is faster than the async tasklets on the outer tasklet. I was curious if getting all up front, or using the auto batcher would be faster. Seems like the former.Coffeehouse
If you are getting it all, the multi_get which is like the auto batcher is faster, however, this allows you to spread the tasklets over multiple cores. The multi_get AFAIK is not like that and will saturate your cores. However, in many cases, you provided a great tip! I also think this provides a very small penalty for much easier to read code, and this allows segmentation so it scales with the number of hospitals you have. Multi_get is not recommened if you are going to be doing many async_tasklets. What's nice is that you don't have async_pauses that you get with the multi_getTicktacktoe
Hey Jon, thanks for the reply, especially a couple of years later! I had no idea you could use the builtin map function like this, it was the key to stopping tasklet errors for me. Thank you for that. Wondering if you might explain a couple of things? "The multi_get AFAIK is not like that and will saturate your cores" and "async_pauses that you get with the multi_get". Just want to make sure i understand, no rush. Thanks man!Coffeehouse
E
3

You can do something with query.map(). See https://developers.google.com/appengine/docs/python/ndb/async#tasklets and https://developers.google.com/appengine/docs/python/ndb/queryclass#Query_map

Externalization answered 11/1, 2013 at 16:1 Comment(1)
This allowed me to explore the possibilities including tasklets and thus I was able to construct the answer that I posted.Ticktacktoe
G
-1

Its impossible. Your 2nd query (ndb.get_multi(b.rooms)) depends on the result of your first query. So pulling it async dosnt work, as at this point the (first) result of the first query has to be avaiable anyway. NDB does something like that in the background (it allready buffers the next items of ndb.get_multi(currhosp.buildings) while you process the first result). However, you could use denormalization, i.e. keeping a big table with one entry per Building-Room-Bed pair, and pull your results from that table. If you have more reads than writes to this table, this will get you a massive speed improvement (1 DB read, instead of 3).

Gymnast answered 11/1, 2013 at 12:1 Comment(3)
It is possible - see my comment aboveTicktacktoe
Seems i was wrong there, sorry for that. After figuring out whats happening in the tasklets, you can actually speed this up. But this affects only the overall-time. The time until the first result is available and the amounts of DB-Reads (=costs) still stay high. So depending on your scenario, denormalization still might be the better solution.Longford
Yeah - I agree denormalization would work - that's my next step! Any recommendations on what I should read to help create a denormalization scheme?Ticktacktoe

© 2022 - 2024 — McMap. All rights reserved.