most efficient way to get, modify and put a batch of entities with ndb
Asked Answered
H

3

11

in my app i have a few batch operations i perform. unfortunately this sometimes takes forever to update 400-500 entities. what i have is all the entity keys, i need to get them, update a property and save them to the datastore and saving them can take up to 40-50 seconds which is not what im looking for.

ill simplify my model to explain what i do (which is pretty simple anyway):

class Entity(ndb.Model):
    title = ndb.StringProperty()

keys = [key1, key2, key3, key4, ..., key500]

entities = ndb.get_multi(keys)

for e in entities:  
    e.title = 'the new title'

ndb.put_multi(entities)

getting and modifying does not take too long. i tried to get_async getting in a tasklet and whatever else is possible which only changes if the get or the forloop takes longer.

but what really bothers me is that a put takes up to 50seconds...

what is the most efficient way to do this operation(s) in a decent amount of time. of course i know that it depends on many factors like the complexity of the entity but the time it takes to put is really over the acceptable limit to me.
i already tried async operations, tasklets...

Hamford answered 16/4, 2012 at 20:14 Comment(2)
What do I need to import for running this script?Haemoglobin
from google.appengine.ext import ndbHamford
W
8

I wonder if doing smaller batches of e.g. 50 or 100 entities will be faster. If you make that into a task let you can try running those tasklets concurrently.

I also recommend looking at this with Appstats to see if that shows something surprising.

Finally assuming this uses the HRD you may find that there is a limit on the number of entity groups per batch. This limit defaults very low. Try raising it.

Welborn answered 17/4, 2012 at 4:54 Comment(5)
all entities are in the same entity group and after a few tests i have to say that using a tasklet to put the entities within a deferred task batching by 50 makes this operation a lot quicker. im speaking about 5-10seconds for the full update which is still not what i had in mind but a lot better than 50 seconds.Hamford
I find this surprising, if they are all in the same group, wouldn't having different batches make you reach the limit of roughly 1 write/second/group?Chrisse
I meant I'm surprised that splitting in more batches helps even when all entities are in the same group.Chrisse
He'd have to show the Appstats output to help me understand that. Maybe his entities are large.Welborn
@Hamford I know it's an old thread, but care to share the general code you used for making the operation quicker using tasklets and batches?Delegate
V
0

Sounds like what MapReduce was designed for. You can do this fast, by simultaneously getting and modifying all the entities at the same time, scaled across multiple server instances. Your cost goes up by using more instances though.

Verditer answered 16/4, 2012 at 20:32 Comment(4)
ok but this is an on demand operation. not sure if mapreduce is an anwer to my question. but i might be wrongHamford
When you say "on demand", I assume you mean "HTTP request initiated?". That doesn't preven you from using mapreduce. However, checking the results when everything completes is a bit more of a hassle, but not impossible.Verditer
yes they are http request initated. and i need to check if the action/operation succeeded or not.Hamford
Have you already tried batch get/put operations? googleappengine.blogspot.ca/2009/06/… Otherwise, I don't know of any faster way than to parallelize it. Anything else just defers the operation operate asynchonously. If you are operating on large data that takes time to write, I'd seriously consider writing asynchronously and use a separate HTTP request to poll on success in case your original connection times out.Verditer
O
0

I'm going to assume that you have the entity design that you want (i.e. I'm not going to ask you what you're trying to do and how maybe you should have one big entity instead of a bunch of small ones that you have to update all the time). Because that wouldn't be very nice. ( =

What if you used the Task Queue? You could create multiple tasks and each task could take as URL params the keys it is responsible for updating and the property and value that should be set. That way the work is broken up into manageable chunks and the user's request can return immediately while the work happens in the background? Would that work?

Ollieollis answered 16/4, 2012 at 23:11 Comment(3)
I think the MapReduce API basically does this for you. It takes your work, batches it into multiple tasks, then issues the tasks in parallel, so that it gets completed quicker.Verditer
the entity design is out of discussion and yes i tought about setting up tasks but then i would have other issues like don't know if the operation went trough or not, all entities are in the same entity group and doing the put(s) in batches could cause contention issues... i still cant understand how a batch put of a bunch of entities (and im not speaking about a million entities) can take almost a minute to go through. these are not background tasks but actions done by a user so i need to know if actions fail or succeed right away.Hamford
GVR's tip on using AppStats is probably of benefit to you, then. How many indexes are being written to when you put one of your entities? How many DB writes is it? Might be a little out of date, but check out this article for why it might take a while to put one of your entities: developers.google.com/appengine/articles/life_of_writeOllieollis

© 2022 - 2024 — McMap. All rights reserved.