App Engine High Replication Datastore
Asked Answered
Z

3

14

I'm a total App Engine newbie, and I want to confirm my understanding of the high replication datastore.

The documentation says that entity groups are a "unit of consistency", and that all data is eventually consistent. Along the same lines, it also says "queries across entity groups can be stale".

Can someone provide some examples where queries can be "stale"? Is it saying I could potentially save an entity without any parent (ie. it's own group), then query for it very soon after and not find it? Does it also imply that if I want data to be always 100% up-to-date I need to save them all in the same entity group?

Is the common workaround for this to use memcache to cache entities for a period of time longer than the average time it takes for data to become consistent across all data centers? What's the ballpark latency for that?

Thanks

Zhao answered 30/5, 2011 at 7:44 Comment(0)
C
18

Is it saying I could potentially save an entity without any parent (ie. it's own group), then query for it very soon after and not find it?

Correct. Technically, this is the case for the regular Master-Slave datastore, too, as indexes are updated asynchronously, but in practice the window of time in which that could happen is so incredibly small you never see it.

If by "query" you mean "do a get by key", though, that will always return strongly consistent results in either implementation.

Does it also imply that if I want data to be always 100% up-to-date I need to save them all in the same entity group?

You'll need to define what you mean by "100% up-to-date" before it's possible to answer that.

Is the common workaround for this to use memcache to cache entities for a period of time longer than the average time it takes for data to become consistent across all data centers?

No. Memcache is strictly for improving access times; you shouldn't use it in any situation where cache eviction will cause trouble.

Strongly consistent gets are always available to you if you need to guarantee that you're seeing the latest version. Without a concrete example of what you're trying to do, though, it's difficult to provide a recommendation.

Cauterize answered 30/5, 2011 at 10:2 Comment(4)
I'm sorry I don't really have a concrete example. I'm trying to learn the system so I can start work on my project. I just want to be able to store data in the datastore, and retrieve the latest version when I need it. I'm just trying to figure out when this isn't the case, and how I can guarantee that when I query for a result I'll get the freshest one. By "query" I meant doing a query by property like one would in SQL, not by key. I just want to understand what they mean by ancestor groups is a "unit of consistency" and what can be "inconsistent".Zhao
Nick, on the Usage Notes section of this doc: code.google.com/intl/en/appengine/docs/python/datastore/hr/… says that "you can put recent posts in memcache with an expiration, and then display a of mix recent posts from memcache and posts retrieved from the datastore.".Tangent
@user439383 (Have you considered setting a more useful username?) Personally I would stop worrying about this until/unless you have a specific case it's of concern. Eventually consistent semantics are fine for most situations, and you'll know when you need strong consistency.Cauterize
Great answer, Nick. I just want to confirm something you said. 'If by "query" you mean "do a get by key", though, that will always return strongly consistent results in either implementation.'. So, if I do: MyNDBModal.get_by_id(theID), I will always find it, even if it was written recently?Coated
F
11

Obligatory blog example setup; Authors have Posts

class Author(db.Model):
    name = db.StringProperty()

class Post(db.Model):
    author = db.ReferenceProperty()
    article = db.TextProperty()

bob = Author(name='bob')
bob.put()

first thing to remember is that regular get/put/delete on a single entity group (including single entity) will work as expected:

post1 = Post(article='first article', author=bob)
post1.put()

fetched_post = Post.get(post1.key())
# fetched_post is latest post1

You will only be able notice inconstancy if you start querying across multiple entity groups. Unless you have specified a parent attribute, all your entities are in separate entity groups. So if it was important that straight after bob creates a post, that he can see there own post then we should be careful with the following:

fetched_posts = Post.all().filter('author =', bob).fetch(x)
# fetched_posts _might_ contain latest post1

fetched_posts might contain the latest post1 from bob, but it might not. This is because all the Posts are not in the same entity group. When querying like this in HR you should think "fetch me probably the latest posts for bob".

Since it is important in our application that the author can see his post in the list straight after creating it, we will use the parent attribute to tie them together, and use an ancestor query to fetch the posts only from within that group:

post2 = Post(parent=person, article='second article', author=bob)
post2.put()

bobs_posts = Post.all().ancestor(bob.key()).filter('author =', bob).fetch(x)

Now we know that post2 will be in our bobs_posts results.

If the aim of our query was to fetch "probably all the latest posts + definitely latest posts by bob" we would need to do another query.

other_posts = Post.all().fetch(x)

Then merge the results other_posts and bobs_posts together to get the desired result.

Furniture answered 1/6, 2011 at 11:8 Comment(1)
This is a great explanation actually. Only thing I don't understand is what is person in your example? A class or instance?Hist
T
5

Having just migrated my app over from the Master/Slave to the High Replication datastore, I have to say that in practice, eventual consistency isn't a problem for most applications.

Consider the classic guestbook example, where you put() a new guestbook post Entity and then immediately query all the posts in the guestbook. With the High Replication datastore, you won't see the new post appear in the query results until a few seconds later (at Google I/O, the Google engineers said that the lag was on the order of 2-5 seconds).

Now, in practice, your guestbook app is probably doing an AJAX post of the new guestbook post entry. There is no need to refetch all the posts after submitting the new post. The webapp can simply insert the new entry into the UI once the AJAX request has succeeded. By the time the user leaves the webpage and returns to it, or even hits the browser refresh button, several seconds will have elapsed, and it is very likely that the new post will be returned by the query that pulls in all the guestbook posts.

Finally, note that the eventual consistency performance applies only to queries. If you put() an entity and immediately call db.get() to fetch it back, the result is strongly consistent, i.e. you will get the latest snapshot of the entity.

Truelove answered 1/6, 2011 at 22:1 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.