Sorting entities and filtering ListProperty without incurring in exploding indexes
Asked Answered
C

4

5

I'm developing a simple Blogging/Bookmarking platform and I'm trying to add a tags-explorer/drill-down feature a là delicious to allow users to filter the posts specifying a list of specific tags.

Something like this: enter image description here

Posts are represented in the datastore with this simplified model:

class Post(db.Model):
    title = db.StringProperty(required = True)
    link = db.LinkProperty(required = True)
    description = db.StringProperty(required = True)
    tags = db.ListProperty(str)
    created = db.DateTimeProperty(required = True, auto_now_add = True)

Post's tags are stored in a ListProperty and, in order to retrieve the list of posts tagged with a specific list of tags, the Post model exposes the following static method:

@staticmethod
def get_posts(limit, offset, tags_filter = []):
        posts = Post.all()
        for tag in tags_filter:
          if tag:
              posts.filter('tags', tag)
        return posts.fetch(limit = limit, offset = offset)

This works well, although I've not stressed it too much.

The problem raises when I try to add a "sorting" order to the get_posts method to keep the result ordered by "-created" date:

@staticmethod
def get_posts(limit, offset, tags_filter = []):
        posts = Post.all()
        for tag in tags_filter:
          if tag:
              posts.filter('tags', tag)
        posts.order("-created")
        return posts.fetch(limit = limit, offset = offset)

The sorting order adds an index for each tag to filter, leading to the dreaded exploding indexes problem.
One last thing that makes this thing more complicated is that the get_posts method should provide some pagination mechanism.

Do you know any Strategy/Idea/Workaround/Hack to solve this problem?

Creamy answered 24/5, 2011 at 9:1 Comment(0)
G
3

Queries involving keys use indexes just like queries involving properties. Queries on keys require custom indexes in the same cases as with properties, with a couple of exceptions: inequality filters or an ascending sort order on key do not require a custom index, but a descending sort order on Entity.KEY_RESERVED_PROPERTY_key_ does.

So use a sortable date string for the primary key of the entity:

class Post(db.Model):
    title = db.StringProperty(required = True)
    link = db.LinkProperty(required = True)
    description = db.StringProperty(required = True)
    tags = db.ListProperty(str)
    created = db.DateTimeProperty(required = True, auto_now_add = True)

    @classmethod
    def create(*args, **kw):
         kw.update(dict(key_name=inverse_millisecond_str() + disambig_chars()))
         return Post(*args, **kw)

...

def inverse_microsecond_str(): #gives string of 8 characters from ascii 23 to 'z' which sorts in reverse temporal order
    t = datetime.datetime.now()
    inv_us = int(1e16 - (time.mktime(t.timetuple()) * 1e6 + t.microsecond)) #no y2k for >100 yrs
    base_100_chars = []
    while inv_us:
        digit, inv_us = inv_us % 100, inv_us / 100
        base_100_str = [chr(23 + digit)] + base_100_chars
    return "".join(base_100_chars)

Now, you don't even have to include a sort order in your queries, although it won't hurt to explicitly sort by key.

Things to remember:

  • This won't work unless you use the "create" here for all your Posts.
  • You'll have to migrate old data
  • No ancestors allowed.
  • The key is stored once per index, so it is worthwhile to keep it short; that's why I'm doing the base-100 encoding above.
  • This is not 100% reliable because of the possibility of key collisions. The above code, without disambig_chars, nominally gives reliability of the number of microseconds between transactions, so if you had 10 posts per second at peak times, it would fail 1/100,000. However, I'd shave off a couple orders of magnitude for possible app engine clock tick issues, so I'd actually only trust it for 1/1000. If that's not good enough, add disambig_chars; and if you need 100% reliability, then you probably shouldn't be on app engine, but I guess you could include logic to handle key collisions on save().
Geophysics answered 9/6, 2011 at 18:39 Comment(7)
@Jameson I need a descending order that, per the documentation, requires a custom index.Creamy
Sorry, I thought my answer was clear on that. I'll revise my answer to show how to make a key name that naturally sorts in reverse-chronological order.Geophysics
@Jameson I'm testing it and it seems to work; I've thought to already have tested this kind of solution using key_name with a timestamp like this (now().strftime("%Y%m%d%H%M%S")) but it did not worked. Why is it different from your solution?Creamy
Note the minus sign in int(1e16 - (time.mktime(t.timetuple()) * 1e6 + t.microsecond)). That means that inverse_microsecond_str is not the time (in base 100), but rather a large number minus the time. In other words, it goes down, not up, as time goes on; and so it sorts in the reverse order.Geophysics
You could change it to base64 with something like B64_CHARLIST = "-0123456789=ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" and base_64_str = B64_CHARLIST[digit] + base_64_str and changing the 100s to 64s. If you do that, you should use some power of 64 instead of the 1e16, to ensure you don't y2k until 2100-something. (Or use a "range" instead of a "while" to ensure all your strings are the same length).Geophysics
OK, good. Any chance I could get the original bounty? I wouldn't ask, but you seem to have rep to spare.Geophysics
Nope, technically is not possible; but there's some workaround :).Creamy
G
3

What if you inverted the relationship? Instead of a post with a list of tags you would have a tag entity with a list of posts.

class Tag(db.Model):
  tag = db.StringProperty()
  posts = db.ListProperty(db.Key, indexed=False)

To search for tags you would do tags = Tag.all().filter('tag IN', ['python','blog','async'])

This would give you hopefully 3 or more Tag entities, each with a list of posts that are using that tag. You could then do post_union = set(tags[0].posts).intersection(tags[1].posts, tags[2].posts) to find the set of posts that have all tags.

Then you could fetch those posts and order them by created (I think). Posts.all().filter('__key__ IN', post_union).order("-created")

Note: This code is off the top of my head, I can't remember if you can manipulate sets like that.

Edit: @Yasser pointed out that you can only do IN queries for < 30 items.

Instead you could have the key name for each post start with the creation time. Then you could sort the keys you retrieved via the first query and just do Posts.get(sorted_posts).

Don't know how this would scale to a system with millions of posts and/or tags.

Edit2: I meant set intersection, not union.

Greaten answered 25/5, 2011 at 4:45 Comment(7)
IN queries can only work for less than 30 items in the membership set. So if a tag is in more than 30 posts, this will probably not work.Fatma
@Cal Thanks for you help. I don't use the key_name to generate the Post but maybe I can sort by Id using sorted_posts = sorted(posts_keys, lambda x: x.id(), reverse=True) . I was thinking to mix the strategies sorting using order("-created") when the number of tags to filter are less or equal than 2, in the other cases I would sort in memory retrieving the keys like you have suggested. What do you think?Creamy
No, ordering by id is not going to work.Creamy
If you don't already have millions of posts, I might go back a run a migration that swaps the posts with ones with [created date]_[uuid] as the keyname. Doing this might return your posts in created order without applying the "-created" filter. Not sure. As long as posts are searched much more often than they are created I think it would be a win. Check out the __make_sid method in gae-sessions for an example of how to do this.Greaten
@Greaten no, it does not work even creating posts with key name using the creation date. I've tested it with key names created like this: now().strftime("%Y%m%d%H%M%S"). Posts are retrieved from the oldest to the newest and unluckily this is not what I want; I would like to have the last inserted on the top of the list.Creamy
@Greaten & @systempuntoout, You should use the (case normalized) 'tag' as the Tag entity key name; then you can do an efficient Tag.get_by_key_name(list_of_tags). This 'inverted index' method is a pretty proven technique. Incorporating the sort criteria into the post key will let you sort in code prior to fetching the posts by key, compute your union then sort / reverse your list of keys, then use db.get(list_of_sorted_post_keys). This basic method will work ok until your lists of posts get large (a couple thousands posts for a given tag), then you'll need to pre-sort & shard index lists.Purehearted
@Greaten I've assigned the bounty for the precious help; this answer is probably the candidate for the accepted mark though I stille have to understand why creating a key_name with the created_date has not worked.Creamy
G
3

Queries involving keys use indexes just like queries involving properties. Queries on keys require custom indexes in the same cases as with properties, with a couple of exceptions: inequality filters or an ascending sort order on key do not require a custom index, but a descending sort order on Entity.KEY_RESERVED_PROPERTY_key_ does.

So use a sortable date string for the primary key of the entity:

class Post(db.Model):
    title = db.StringProperty(required = True)
    link = db.LinkProperty(required = True)
    description = db.StringProperty(required = True)
    tags = db.ListProperty(str)
    created = db.DateTimeProperty(required = True, auto_now_add = True)

    @classmethod
    def create(*args, **kw):
         kw.update(dict(key_name=inverse_millisecond_str() + disambig_chars()))
         return Post(*args, **kw)

...

def inverse_microsecond_str(): #gives string of 8 characters from ascii 23 to 'z' which sorts in reverse temporal order
    t = datetime.datetime.now()
    inv_us = int(1e16 - (time.mktime(t.timetuple()) * 1e6 + t.microsecond)) #no y2k for >100 yrs
    base_100_chars = []
    while inv_us:
        digit, inv_us = inv_us % 100, inv_us / 100
        base_100_str = [chr(23 + digit)] + base_100_chars
    return "".join(base_100_chars)

Now, you don't even have to include a sort order in your queries, although it won't hurt to explicitly sort by key.

Things to remember:

  • This won't work unless you use the "create" here for all your Posts.
  • You'll have to migrate old data
  • No ancestors allowed.
  • The key is stored once per index, so it is worthwhile to keep it short; that's why I'm doing the base-100 encoding above.
  • This is not 100% reliable because of the possibility of key collisions. The above code, without disambig_chars, nominally gives reliability of the number of microseconds between transactions, so if you had 10 posts per second at peak times, it would fail 1/100,000. However, I'd shave off a couple orders of magnitude for possible app engine clock tick issues, so I'd actually only trust it for 1/1000. If that's not good enough, add disambig_chars; and if you need 100% reliability, then you probably shouldn't be on app engine, but I guess you could include logic to handle key collisions on save().
Geophysics answered 9/6, 2011 at 18:39 Comment(7)
@Jameson I need a descending order that, per the documentation, requires a custom index.Creamy
Sorry, I thought my answer was clear on that. I'll revise my answer to show how to make a key name that naturally sorts in reverse-chronological order.Geophysics
@Jameson I'm testing it and it seems to work; I've thought to already have tested this kind of solution using key_name with a timestamp like this (now().strftime("%Y%m%d%H%M%S")) but it did not worked. Why is it different from your solution?Creamy
Note the minus sign in int(1e16 - (time.mktime(t.timetuple()) * 1e6 + t.microsecond)). That means that inverse_microsecond_str is not the time (in base 100), but rather a large number minus the time. In other words, it goes down, not up, as time goes on; and so it sorts in the reverse order.Geophysics
You could change it to base64 with something like B64_CHARLIST = "-0123456789=ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" and base_64_str = B64_CHARLIST[digit] + base_64_str and changing the 100s to 64s. If you do that, you should use some power of 64 instead of the 1e16, to ensure you don't y2k until 2100-something. (Or use a "range" instead of a "while" to ensure all your strings are the same length).Geophysics
OK, good. Any chance I could get the original bounty? I wouldn't ask, but you seem to have rep to spare.Geophysics
Nope, technically is not possible; but there's some workaround :).Creamy
E
2

This question sounds similar to:

As pointed by Robert Kluin in the last one, you could also consider using a pattern similar to "Relation Index" as described in this Google I/O presentation.

# Model definitions
class Article(db.Model):
  title = db.StringProperty()
  content = db.StringProperty()

class TagIndex(db.Model):
  tags = db.StringListProperty()

# Tags are child entities of Articles
article1 = Article(title="foo", content="foo content")
article1.put()
TagIndex(parent=article1, tags=["hop"]).put()

# Get all articles for a given tag
tags = db.GqlQuery("SELECT __key__ FROM Tag where tags = :1", "hop")
keys = (t.parent() for t in tags)
articles = db.get(keys)

Depending on how many Page you expect back by Tags query, sorting could either be made in memory or by making the date string representation part of Article key_name

Updated with StringListProperty and sorting notes after Robert Kluin and Wooble comments on #appengine IRC channel.

Eucharist answered 7/6, 2011 at 12:27 Comment(3)
I need to get the articles not just for a given tag but for a given list of tags.Creamy
@systempuntoout: articles with all the tags, of any of the tags ?Eucharist
So I meant do you need to get 1/ articles with all the tags, or 2/ any of the tags ?Eucharist
F
0

One workaround could be this:

Sort and concatenate a post's tags with a delimiter like | and store them as a StringProperty when storing a post. When you receive the tags_filter, you can sort and concatenate them to create a single StringProperty filter for the posts. Obviously this would be an AND query and not an OR query but thats what your current code seems to be doing as well.

EDIT: as rightly pointed out, this would only match exact tag list not partial tag list, which is obviously not very useful.

EDIT: what if you model your Post model with boolean placeholders for tags e.g. b1, b2, b3 etc. When a new tag is defined, you can map it to the next available placeholder e.g. blog=b1, python=b2, async=b3 and keep the mapping in a separate entity. When a tag is assigned to a post, you just switch its equivalent placeholder value to True.

This way when you receive a tag_filter set, you can construct your query from the map e.g.

Post.all().filter("b1",True).filter("b2",True).order('-created')

can give you all the posts which have tags python and blog.

Fatma answered 24/5, 2011 at 16:17 Comment(3)
Searching for blog|python should return all the posts that have at least these two tags; the search should match for example the posts tagged with async|blog|python as well and your method does not seem to work in this case. Thanks for your help.Creamy
sorry but I don't get it. How many boolean placeholders should a Post model have with this method?Creamy
you can have a max of 200 indexes, right? so you can have as many as your app can allow after using its other indices. that will be the upper limit on the number of tags a post can have.Fatma

© 2022 - 2024 — McMap. All rights reserved.