strcmp for python or how to sort substrings efficiently (without copy) when building a suffix array
Asked Answered
R

4

10

Here's a very simple way to build an suffix array from a string in python:

def sort_offsets(a, b):
    return cmp(content[a:], content[b:])

content = "foobar baz foo"
suffix_array.sort(cmp=sort_offsets)
print suffix_array
[6, 10, 4, 8, 3, 7, 11, 0, 13, 2, 12, 1, 5, 9]

However, "content[a:]" makes a copy of content, which becomes very inefficient when content gets large. So i wonder if there's a way to compare the two substrings without having to copy them. I've tried to use the buffer-builtin, but it didn't worked.

Recalcitrate answered 17/2, 2010 at 16:47 Comment(2)
What does your 'content' typically look like? English text? Random sequence? Something in-between? What are the chances of long (say over 100 characters) repeated substrings in 'content'?Phonoscope
I wrote this Python code that can sort all substrings of long string (1000000 characters) and find the longest repeated substring in 5 seconds.Sniggle
R
6

The buffer function does not copy the whole string, but creates an object that only references the source string. Using interjay's suggestion, that would be:

suffix_array.sort(key=lambda a: buffer(content, a))
Roseline answered 17/2, 2010 at 16:59 Comment(2)
I've played around with buffer too, but didn't get the correct suffix array. This line looks very promising, but it doesn't work either (my short example above works, but it breaks on larger ones). As soon as i know why it breaks, i'll comment again.Recalcitrate
Ha! It's because str != unicode. My larger strings are unicode and therefore i should have written: sizeof_pointer = len(str(buffer(u'a'))) suffix_array.sort(key=lambda a: buffer(content, a * sizeof_pointer)) To avoid this ugliness, i'll use normalized utf8-encoded strings instead of unicode. Weird.Recalcitrate
V
5

I don't know if there's a fast way to compare substrings, but you can make your code much faster (and simpler) by using key instead of cmp:

suffix_array.sort(key=lambda a: content[a:])

This will create the substring just once for each value of a.

Edit: A possible downside is that it will require O(n^2) memory for the substrings.

Vicennial answered 17/2, 2010 at 16:52 Comment(2)
And the sort's cmp argument is gone in 3.x.Entomophagous
This takes 15 seconds for a string of length 75000 on my machine, so this won't scale - but nice idea.Recalcitrate
P
3

+1 for a very interesting problem! I can't see any obvious way to do this directly, but I was able to get a significant speedup (an order of magnitude for 100000 character strings) by using the following comparison function in place of yours:

def compare_offsets2(a, b):
    return (cmp(content[a:a+10], content[b:b+10]) or
            cmp(content[a:], content[b:]))

In other words, start by comparing the first 10 characters of each suffix; only if the result of that comparison is 0, indicating that you've got a match for the first 10 characters, do you go on to compare the entire suffices.

Obviously 10 could be anything: experiment to find the best value.

This comparison function is also a nice example of something that isn't easily replaced with a key function.

Phonoscope answered 17/2, 2010 at 18:38 Comment(3)
I get even better results by first doing a key-based sort with key = lambda a: content[a:a+10] and then following up with the cmp-based sort above. Python's sort algorithm does especially well for lists that are already in almost-sorted order.Phonoscope
Good idea! The real usecase is to find all documents matching some alphanumeric substring. The "content" is a concatenation of all documents and since i know what's the offset for each document in content, i use binary search (in the offset-list) to find the maximum number of characters i'll have to copy. But it's still pretty slow. Probably i'll have to do it in c...Recalcitrate
One other possibility before you resort to C: it might be possible to use the ctypes module to get at C's strcmp, and use that from Python. Embedded null characters in Python strings would cause difficulties, but that might not be an issue in practice. But I agree that this seems like exactly the sort of task where rewriting the slow part in C is appropriate.Phonoscope
E
0

You could use the blist extension type that I wrote. A blist works like the built-in list, but (among other things) uses copy-on-write so that taking a slice takes O(log n) time and memory.

from blist import blist

content = "foobar baz foo"
content = blist(content)
suffix_array = range(len(content))
suffix_array.sort(key = lambda a: content[a:])
print suffix_array
[6, 10, 4, 8, 3, 7, 11, 0, 13, 2, 12, 1, 5, 9]

I was able to create a suffix_array from a randomly generated 100,000-character string in under 5 seconds, and that includes generating the string.

Eyrie answered 5/3, 2010 at 0:4 Comment(2)
Tested blist with a teststring of 12Mio characters. I gave up after one hour of cpu-time. Memory usage was 21GByte at this point and growing. The buffer-solution uses 6.7GByte and finishes after 8 Minutes. Since my real data has about 500Mio characters both solutions won't work. Atm i use sites.google.com/site/yuta256/sais to get the sorted suffix_array and read it back into python with array.array.fromfile...Recalcitrate
I suggest you update your original question to mention that your real data contains 500 million characters. How much memory does it take just to store the suffix_array before sorting? For data that large, I'd definitely use C to cut down on the memory overhead.Eyrie

© 2022 - 2024 — McMap. All rights reserved.