Are there any radix/patricia/critbit trees for Python?
Asked Answered
M

2

12

I have about 10,000 words used as a set of inverted indices to about 500,000 documents. Both are normalized so the index is a mapping of integers (word id) to a set of integers (ids of documents which contain the word).

My prototype uses Python's set as the obvious data type.

When I do a search for a document I find the list of N search words and their corresponding N sets. I want to return the set of documents in the intersection of those N sets.

Python's "intersect" method is implemented as a pairwise reduction. I think I can do better with a parallel search of sorted sets, so long as the library offers a fast way to get the next entry after i.

I've been looking for something like that for some time. Years ago I wrote PyJudy but I no longer maintain it and I know how much work it would take to get it to a stage where I'm comfortable with it again. I would rather use someone else's well-tested code, and I would like one which supports fast serialization/deserialization.

I can't find any, or at least not any with Python bindings. There is avltree which does what I want, but since even the pair-wise set merge take longer than I want, I suspect I want to have all my operations done in C/C++.

Do you know of any radix/patricia/critbit tree libraries written as C/C++ extensions for Python?

Failing that, what is the most appropriate library which I should wrap? The Judy Array site hasn't been updated in 6 years, with 1.0.5 released in May 2007. (Although it does build cleanly so perhaps It Just Works.)

(Edit: to clarify what I'm looking for from an API, I want something like:

def merge(document_sets):
    probe_i = 0
    probe_set = document_sets[probe_i]
    document_id = GET_FIRST(probe_set)

    while IS_VALID(document_id):
        # See if the document is present in all sets
        for i in range(1, len(document_sets)):
            # dynamically adapt to favor the least matching set
            target_i = (i + probe_i) % len(document_sets)
            target = document_sets[target_i]
            if document_id not in target_set:
                probe_i = target_id
                probe_set = document_sets[probe_i]
                document_id = GET_NEXT(probe_set, document_id)
                break
        else:
            yield document_id

I'm looking for something which implements GET_NEXT() to return the next entry which occurs after the given entry. This corresponds to Judy1N and the similar entries for other Judy arrays.

This algorithm dynamically adapts to the data should preferentially favor sets with low hits. For the type of data I work with this has given a 5-10% increase in performance.) )

Middelburg answered 16/1, 2011 at 18:44 Comment(6)
Just in case: take a look at nltk.org if you did not yet.Takeover
I have not. A quick scan doesn't find anything relevant. Can you be more specific?Middelburg
Do you really need a generator or are you ultimately interested in just getting a set/list of the document numbers?Schedule
@Justin: All I need is the final set. The example here was to show how I think I can take advantage of an ordered set to make a faster intersection function. Without an iterator then some rare cases might return a 15 million set intersection, but that's not a serious concern for my problem.Middelburg
Sets really are the way to go here. The problem is the way that multi-set intersections are implemented such that (N-1) sets are made. That is the source of the huge slowdown. I would personally just make my own (C) version of set based that in the Python distribution. I also think that this should be a requested change in the Python distributions. What version of Python are you using?Schedule
@Justin: I do plan to propose/implement this, plus a new function. There should also be a (has, has_not) = set.split(iterable) where "has" has all the elements of "set" which are in the iterable, with the other "set" elements go into "has_not". I'm using 2.6 and I looked at the 2.7 source to see its implementation.Middelburg
A
5

Yes, there are some, though I'm not sure if they're suitable for your use case: but it seems none of them are what you asked for.

BioPython has a Trie implementation in C.

Ah, here's a nice discussion including benchmarks: http://bugs.python.org/issue9520

Other (some very stale) implementations:

http://pypi.python.org/pypi/radix

py-radix is an implementation of a radix tree data structure for the storage and retrieval of IPv4 and IPv6 network prefixes.

https://bitbucket.org/markon/patricia-tree/src

A Python implementation of patricia-tree

http://pypi.python.org/pypi/trie

A prefix tree (trie) implementation.

http://pypi.python.org/pypi/logilab-common/0.50.3

patricia.py : A Python implementation of PATRICIA trie (Practical Algorithm to Retrieve Information Coded in Alphanumeric).

Advisee answered 16/1, 2011 at 19:30 Comment(1)
Biopython's trie doesn't support a "get next entry after i" in the API. Radix is specifically for IPv4/6 formatted addresses and not a general purpose library. Trie is (as you say) pure Python and it does not have a "get next" interface. "patricia.py" is noticeably missing from the logilab repository.Middelburg
F
3

I've recently added iteration support to datrie, you may give it a try.

Fluff answered 27/7, 2012 at 0:45 Comment(4)
Sadly, crashes for me, python 3.3, win7 32-bit, datrie 0.4.2, I try to run example code given at the github page. :-/Makebelieve
Application crash when loading datrie: c0000374, looks like a heap corruption.Makebelieve
Can you please fire an issue in the bug tracker? Is the alphabet set properly?Fluff
As far as I can remember I set 'abcdef' as alphabet and only added t['a'] = 'a', t['b'] = 'b' etc. with t being a Trie. I will try to elaborate this soon, I have removed datrie for now and will install it tonight again tonight. I use mingw 4.6.2 win32 to compile.Makebelieve

© 2022 - 2024 — McMap. All rights reserved.