efficient function to retrieve a queryset of ancestors of an mptt queryset

I

2

17

Does anybody have an efficient algorithm to retrieve all ancestors of an mptt queryset? The best I could think of so far is something like this:

def qs_ancestors(queryset):
    if isinstance(queryset, EmptyQuerySet):
        return queryset
    queryset_aggs = queryset.values_list('tree_id', 'level').annotate(max_lft=Max('lft'), min_rght=Min('rght'))
    new_queryset = queryset.none()
    for tree_id, level, max_lft, min_rght in queryset_aggs:
        ancestors = MyModel.objects.filter(
           tree_id=tree_id,
           level__lt=level, 
           lft__lte=max_lft,
           rght__gte=min_rght,
        )
        new_queryset = ancestors | new_queryset
    return new_queryset

There are two problems with this approach:

It fails if there are branches that aren't next to each other (ie it doesn't really work)
It is highly inefficient because it ends up have number_of_trees*number_of_levels clauses in the final query, which can get very large very fast

I am open to caching the ancestors somewhere else, but I cannot think of a way to do efficiently. I considered adding a field with a comma separated list of ancestor's ids and then doing a GROUP_CONCAT (I am in MySQL) inside an extra, but I think that could get huge/slow.

Imhoff answered 24/6, 2011 at 17:15 Comment(0)

T

4

How about:

def qs_ancestors(queryset):
    if isinstance(queryset, EmptyQuerySet):
        return queryset
    new_queryset = queryset.none()
    for obj in queryset:
        new_queryset = new_queryset | obj.get_ancestors()
return new_queryset

It's still len(queryset) clauses. You could potentially reduce the number of clauses a bit by doing a preprocess pass that removes objects that are ancestors of other objects in the queryset, something like:

min_obj_set = []
for obj in queryset.order_by('tree_id', '-level'):
    for obj2 in min_obj_set:
        if obj.is_ancestor_of(obj2):
            break
    else:
        min_obj_set.append(obj)

Although the above snippet is only an example, you'll probably want to use a BST if your querset contains a significant amount of objects.

You'll have to test if this yeilds an increase in speed vs. the larger DB query, though.

Totalizer answered 30/6, 2011 at 16:2 Comment(5)

The issue with this approach is that needs to make a query for each object. obj.get_ancestors() executes a query, does it not? So you end up with len(qs) + 1 queries without the optimization. – Imhoff 4/7, 2011 at 16:41

@bacon no, obj.get_ancestors() returns a QuerySet object which is lazily evaluated. The QuerySets are ORed together in the loop - in this case, the where clauses will be ORed - and then only executed later when you attempt to get access to the values. There will only be two queries executed, one for the original QuerySet, and one to get all the ancestors. – Gonzales 4/7, 2011 at 19:58

Ahh -- sorry, good call. So yours would be roughly the same as @Cesar's? – Imhoff 4/7, 2011 at 20:5

You could eliminate that extra query with a .select_related, though. – Imhoff 4/7, 2011 at 22:13

I've never looked into how django handles select_related on a model linked to itself with a ForeignKey field. It may well cause an infinite loop unless a depth constraint is used. In this case it isn't needed though. node.parent_id is equivalent to node.parent.pk, without an extra query as described in the django docs. A query with select_related will likely return more columns so it will be less efficient. – Gonzales 4/7, 2011 at 22:42

T

6

I had to write a similar algorithm once. I had a view displaying a MPTT tree, it's was a very large tree so I couldn't load all of it's data in the HTML template. So I displayed only the root nodes in the initial load and used Ajax to load the other nodes.

It was working good until my boss asked me to implement a 'search' option. The search had to look in all nodes and explode the tree were it found a match. It took me a while to figure this out, but I finally got it. Here is the solution a came up with:

from django.db.models import Q

def get_parents(self, qs):
    tree_list = {}
    query = Q()
    for node in qs:
        if node.tree_id not in tree_list:
            tree_list[node.tree_id] = []

        parent =  node.parent.pk if node.parent is not None else None,

        if parent not in tree_list[node.tree_id]:
            tree_list[node.tree_id].append(parent)

            query |= Q(lft__lt=node.lft, rght__gt=node.rght, tree_id=node.tree_id)

    return YourModel.objects.filter(query)

It needs only two queries to run, the initial qs passed as an argument and the final queryset returned by the function. The tree_list is a dictionary that stores nodes that were already added to the queryset, it's an optimization and it's not necessary for the algorithm to work. But since I was working with a relative big tree I had to include it.

I guess you could turn this method into a manager and make it more generic, i.e. make it work for any MPTT model and not only YourModel

Tenterhook answered 3/7, 2011 at 21:9 Comment(1)

Very smart -- I ended up adding a text field onto my mptt model called ancestor_ids, which has the a comma separated list of ids -- then it is just a matter of doing this:

queryset_ancestor_ids = set([]); for a in queryset.values_list('ancestor_ids', flat=True): if a: queryset_ancestor_ids.update(a.split(",")); return SegmentNode.objects.filter(pk__in=queryset_ancestor_ids)

I will do some benchmarking to see which is faster. – Imhoff 4/7, 2011 at 16:38

T

4