Does the NDB membership query ("IN" operation) performance degrade with lots of possible values?

Asked 13/8, 2012 at 14:8 Answered 3/3, 2014 at 1:51

Solved google-app-engine google-cloud-datastore app-engine-ndb

The documentation for the IN query operation states that those queries are implemented as a big OR'ed equality query:

qry = Article.query(Article.tags.IN(['python', 'ruby', 'php']))

is equivalent to:

qry = Article.query(ndb.OR(Article.tags == 'python',
                           Article.tags == 'ruby',
                           Article.tags == 'php'))

I am currently modelling some entities for a GAE project and plan on using these membership queries with a lot of possible values:

qry = Player.query(Player.facebook_id.IN(list_of_facebook_ids))

where list_of_facebook_ids could have thousands of items.

Will this type of query perform well with thousands of possible values in the list? If not, what would be the recommended approach for modelling this?

Baggott answered 13/8, 2012 at 14:8 Comment(0)

This won't work with thousands of values (in fact I bet it starts degrading with more than 10 values). The only alternative I can think of are some form of precomputation. You'll have to change your schema.

Buhl answered 13/8, 2012 at 19:30 Comment(4)

Thanks. I am thinking about using 'facebook_id' as the key for my entities, and using ndb.get_multi() to fetch all the entities that are in my list of possible values. Would that work? – Baggott 13/8, 2012 at 19:58

get_multi() definitely works fine with hundreds of keys; I'd be hesitant with thousands although if the entities are small or you expect many of these to be non-existent you may be able to do it. – Buhl 14/8, 2012 at 20:9

Thanks! Is there a way to do a "keys only" get_multi()? I'm really not that interested in fetching the whole entities, I just want to answer the question "among this list of keys, which are present in the data store?". Is there something tailored to this sort of query? – Baggott 14/8, 2012 at 21:2

No, we don't have that feature. If your entities are big, consider breaking them in two: a lightweight summary and an optional part containing the rarer-used, larger properties. You could make the latter a child of the former, so you can read and write them atomically. – Buhl 15/8, 2012 at 20:30

One way you can you do it is to create a new model called FacebookPlayer which is an index. This would be keyed by facebook_id. You would update it whenever you add a new player. It looks something like this:

class FacebookUser(ndb.Model):
    player = ndb.KeyProperty(kind='Player', required=True)

Now you can avoid queries altogether. You can do this:

# Build keys from facebook ids.
facebook_id_keys = []
for facebook_id in list_of_facebook_ids:
    facebook_id_keys.append(ndb.Key('FacebookPlayer', facebook_id))

keysOfUsersMatchedByFacebookId = []
for facebook_player in ndb.get_multi(facebook_id_keys):
    if facebook_player:
        keysOfUsersMatchedByFacebookId.append(facebook_player.player)
usersMatchedByFacebookId = ndb.get_multi(keysOfUsersMatchedByFacebookId)

If list_of_facebook_ids is thousands of items, you should do this in batches.

Webworm answered 3/3, 2014 at 1:51 Comment(0)

Recommended topics

Hot tags