Fastest way to search a 1 GB+ string of data for the first occurrence of a pattern
Asked Answered
G

11

12

There's a 1 gigabyte string of arbitrary data which you can assume to be equivalent to something like:

1_gb_string=os.urandom(1*gigabyte)

We will be searching this string, 1_gb_string, for an infinite number of fixed width, 1 kilobyte patterns, 1_kb_pattern. Every time we search the pattern will be different. So caching opportunities are not apparent. The same 1 gigabyte string will be searched over and over. Here is a simple generator to describe what's happening:

def findit(1_gb_string):
    1_kb_pattern=get_next_pattern()
    yield 1_gb_string.find(1_kb_pattern)

Note that only the first occurrence of the pattern needs to be found. After that, no other major processing should be done.

What can I use that's faster than Python's built-in find() for matching 1 KB patterns against 1 GB or greater data strings, memory requirements limited to 16 GB?

(I am already aware of how to split up the string and searching it in parallel, so you can disregard that basic optimization.)

Guyguyana answered 17/11, 2009 at 17:11 Comment(3)
Is 1_gb_string likely to change?Starkey
It doesn't sound like it, but are you only searching along the fixed-width chunks? As in, if it were bytes and megabytes instead of kilobytes and gigabytes, would a string that contains the following two bytes: "49FA 32D1" match the 1-byte pattern of "FA32"?Parthen
> Is 1_gb_string likely to change? No, remains the same through out all the runs. > are you only searching along the fixed-width chunks? Unfortunately no.Guyguyana
S
12

As you clarify that long-ish preprocessing is acceptable, I'd suggest a variant of Rabin-Karp: "an algorithm of choice for multiple pattern search", as wikipedia puts it.

Define a "rolling hash" function, i.e., one such that, when you know the hash for haystack[x:x+N], computing the hash for haystack[x+1:x+N+1] is O(1).
(Normal hashing functions such as Python's built-in hash do not have this property, which is why you have to write your own, otherwise the preprocessing becomes exhaustingly long rather than merely long-ish;-).
A polynomial approach is fruitful, and you could use, say, 30-bit hash results (by masking if needed, i.e., you can do the computation w/more precision and just store the masked 30 bits of choice).
Let's call this rolling hash function RH for brevity.

So, compute 1 G of RH results as you roll along the haystack 1 GB string; if you just stored these it would give you an array H of 1 G 30-bit values (4 GB) mapping index-in-haystack->RH value. But you want the reverse mapping, so use instead an array A of 230 entries (1 G entries) that for each RH value gives you all the indices of interest in the haystack (indices at which that RH value occurs); for each entry you store the index of the first possibly-interesting haystack index into another array B of 1 G indices into the haystack which is ordered to keep all indices into haystack with identical RH values ("collisions" in hashing terms) adjacent.
H, A and B both have 1 G entries of 4 bytes each, so 12 GB total.

Now for each incoming 1 K needle, compute its RH, call it k, and use it as an index into A; A[k] gives you the first index b into B at which it's worth comparing. So, do:

ib = A[k]
b = B[ib]
while b < len(haystack) - 1024:
  if H[b] != k: return "not found"
  if needle == haystack[b:b+1024]: return "found at", b
  ib += 1
  b = B[ib]

with a good RH you should have few collisions, so the while should execute very few times until returning one way or another.
So each needle-search should be really really fast.

Sip answered 17/11, 2009 at 18:19 Comment(1)
I think you can do without H/H once A and B are set up: for ib in range(A[k], A[k+1]): …. Almost a pity 4 bytes isn't all it takes to have an int in a Python list… NumPy ndarray(…, int32) should work.Praetorian
E
5

There are a number of string matching algorithms use in the field of genetics to find substrings. You might try this paper or this paper

Evaleen answered 17/11, 2009 at 17:19 Comment(2)
Don't see a perfect fit yet, but something along these lines of this may be the way to go. Generic caching and dynamic programming as others have suggested is not practical, since this will quickly exhaust all memory.Guyguyana
Alas, both links are broken.Praetorian
R
1

Are you willing to spend a significant time preprocessing the string?

If you are, what you can do is build a list of n-grams with offsets.

Suppose your alphabet is hex bytes and you are using 1-grams.

Then for 00-ff, you can create a dictionary that looks like this(perlese, sorry)

$offset_list{00} = @array_of_offsets
$offset_list{01} = #...etc

where you walk down the string and build the @array_of_offsets from all points where bytes happen. You can do this for arbitrary n-grams.

This provides a "start point for search" that you can use to walk.

Of course, the downside is that you have to preprocess the string, so that's your tradeoff.

edit:


The basic idea here is to match prefixes. This may bomb out badly if the information is super-similar, but if it has a fair amount of divergence between n-grams, you should be able to match prefixes pretty well.

Let's quantify divergence, since you've not discussed the kind of info you're analyzing. For the purposes of this algorithm, we can characterize divergence as a distance function: you need a decently high Hamming distance. If the hamming distance between n-grams is, say, 1, the above idea won't work. But if it's n-1, the above algorithm will be much easier.

To improve on my algorithm, let's build an algorithm that does some successive elimination of possibilities:

We can invoke Shannon Entropy to define information of a given n-gram. Take your search string and successively build a prefix based upon the first m characters. When the entropy of the m-prefix is 'sufficiently high', use it later.

  1. Define p to be an m-prefix of the search string
  2. Search your 1 GB string and create an array of offsets that match p.
  3. Extend the m-prefix to be some k-prefix, k > m, entropy of k-prefix higher than m-prefix.
  4. Keep the elements offset array defined above, such that they match the k-prefix string. Discard the non-matching elements.
  5. Goto 4 until the entire search string is met.

In a sense, this is like reversing Huffman encoding.

Racoon answered 17/11, 2009 at 17:23 Comment(2)
I am willing to spend significant time preprocessing, but only up to 16GB of overhead memory for this 1GB string.Guyguyana
Well, that means your ngrams probably have to be on the order of 32-grams(8GB of memory storing ngrams).Racoon
G
1

As far as I know, standard find algorithm is naive algorithm with complexity about n*m comparisons, because it checks patterns against every possible offset. There are some more effective algoithms, requiring about n+m comparisons. If your string is not a natural language string, you can try Knuth–Morris–Pratt algorithm . Boyer–Moore search algorithm is fast and simple enough too.

Guglielma answered 17/11, 2009 at 17:41 Comment(0)
T
1

One efficient but complex way is full-text indexing with the Burrows-Wheeler transform. It involves performing a BWT on your source text, then using a small index on that to quickly find any substring in the text that matches your input pattern.

The time complexity of this algorithm is roughly O(n) with the length of the string you're matching - and independent of the length of the input string! Further, the size of the index is not much larger than the input data, and with compression can even be reduced below the size of the source text.

Tedesco answered 19/11, 2009 at 10:19 Comment(0)
T
0

With infinite memory, you can hash every 1k string along with its position in the 1 GB file.

With less than infinite memory, you will be bounded by how many memory pages you touch when searching.

Thoria answered 17/11, 2009 at 17:18 Comment(0)
S
0

I don't know definitively if the find() method for strings is faster than the search() method provided by Python's re (regular expressions) module, but there's only one way to find out.

If you're just searching a string, what you want is this:

import re
def findit(1_gb_string):
    yield re.search(1_kb_pattern, 1_gb_string)

However, if you really only want the first match, you might be better off using finditer(), which returns an iterator, and with such large operations might actually be better.

Smashing answered 17/11, 2009 at 17:28 Comment(0)
C
0

http://www.youtube.com/watch?v=V5hZoJ6uK-s Will be of most value to you. Its an MIT lecture on Dynamic Programming

Cooperate answered 17/11, 2009 at 17:30 Comment(0)
F
0

If the patterns are fairly random, you can precompute the location of n-prefixes of strings.

Instead of going over all options for n-prefixes, just use the actual ones in the 1GB string - there will be less than 1Gig of those. Use as big a prefix as fits in your memory, I don't have 16GB RAM to check but a prefix of 4 could work (at least in a memory-efficient data structures), if not try 3 or even 2.

For a random 1GB string and random 1KB patterns, you should get a few 10s of locations per prefix if you use 3-byte prefixes, but 4-byte prefixes should get you an average of 0 or 1 , so lookup should be fast.

Precompute Locations

def find_all(pattern, string):
  cur_loc = 0
  while True:
     next_loc = string.find(pattern, cur_loc)
     if next_loc < 0: return
     yield next_loc
     cur_loc = next_loc+1

big_string = ...
CHUNK_SIZE = 1024
PREFIX_SIZE = 4
precomputed_indices = {}
for i in xrange(len(big_string)-CHUNK_SIZE):
  prefix = big_string[i:i+PREFIX_SIZE]
  if prefix not in precomputed_indices:
    precomputed_indices[prefix] = tuple(find_all(prefix, big_string))

Look up a pattern

def find_pattern(pattern):
  prefix = pattern[:PREFIX_SIZE]
  # optimization - big prefixes will result in many misses
  if prefix not in precomputed_indices:
    return -1
  for loc in precomputed_indices[prefix]:
    if big_string[loc:loc+CHUNK_SIZE] == pattern:
        return loc
  return -1
Freedwoman answered 17/11, 2009 at 18:7 Comment(0)
F
0

Someone hinted at a possible way to index this thing if you have abundant RAM (or possibly even disk/swap) available.

Imagine if you performed a simple 32-bit CRC on a 1 K block extending from each character in the original Gig string. This would result in 4 bytes of checksum data for each byte offset from the beginning of the data.

By itself this might give a modest improvement in search speed. The checksum of each 1 K search target could be checked against each CRC ... which each collision tested for a true match. That should still be a couple orders of magnitude faster than a normal linear search.

That, obviously, costs us 4 GB of RAM for the CRC array (plus the original Gig for the original data and a little more overhead for the environment and our program).

If we have ~16 GB we could sort the checksums and store a list of offsets where each is found. That becomes an indexed search (average of about 16 probes per target search ... worst case around 32 or 33 (might be a fence post there).

It's possible that a 16 GB file index would still give better performance than a linear checksum search and it would almost certainly be better than a linear raw search (unless you have extremely slow filesystems/storage).

(Adding): I should clarify that this strategy is only beneficial given that you've described a need to do many searches on the same one gigabyte data blob.

You might use a threaded approach to building the index (while reading it as well as having multiple threads performing the checksumming). You might also offload the indexing into separate processes or a cluster of nodes (particularly if you use a file-based index --- the ~16 GB option described above). With a simple 32-bit CRC you might be able to perform the checksums/indexing as fast as your reader thread can get the data (but we are talking about 1024 checksums for each 1 K of data so perhaps not).

You might further improve performance by coding a Python module in C for actually performing the search ... and/or possibly for performing the checksumming/indexing.

The development and testing of such C extensions entail other trade-offs, obviously enough. It sounds like this would have near zero re-usability.

Faith answered 17/11, 2009 at 18:39 Comment(0)
R
-1

In this problem, we have a 1-gigabyte string of arbitrary data named 1_gb_string, and we need to search for an infinite number of fixed-width 1-kilobyte patterns (1_kb_pattern) within this string. The goal is to find the first occurrence of each pattern efficiently, considering memory limitations of up to 16 GB.

To accomplish this, we can use the following approach:

  1. Step 1: Define a function search_pattern_in_large_string that takes the pattern and large_string as input.

    import re
    
    def search_pattern_in_large_string(pattern, large_string):
    
  2. Step 2: Set the chunk size to read from the large string. In this case, we choose a chunk size of 1024 bytes (1 kilobyte).

    chunk_size = 1024
    
  3. Step 3: Open the large string file and iterate through it in chunks until the end of the file is reached.

    with open(large_string, 'r') as file:
        while True:
            chunk = file.read(chunk_size)
            if not chunk:
                break  # Reached the end of the file
    
  4. Step 4: Use regular expressions (re) to search for the pattern within each chunk of the large string.

            match = re.search(pattern, chunk)
    
  5. Step 5: If a match is found, return the matched pattern. This will provide the first occurrence of the pattern in the large string.

           if match:
               return match.group()  # Return the matched pattern
    
  6. Step 6: If the pattern is not found after searching through the entire large string, return None.

    return None  # Pattern not found
    
  7. Step 7: Now, you can use the search_pattern_in_large_string function by passing the pattern to search (pattern_to_search) and the file path of the large string (large_string_file_path). The function will return the first occurrence of the pattern or None if the pattern is not found.

    pattern_to_search = r'example pattern'
    large_string_file_path = 'path/to/large_string.txt'
    result = search_pattern_in_large_string(pattern_to_search, large_string_file_path)
    if result:
        print("Pattern found:", result)
    else:
        print("Pattern not found.")
    

By implementing this approach, you can efficiently search for 1-kilobyte patterns within a 1-gigabyte or larger data string while considering memory limitations. The function reads the string in manageable chunks, utilizing regular expressions to find the pattern's first occurrence.

Richelle answered 12/6, 2023 at 10:59 Comment(3)
Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.Ricarda
How long will this take when pattern does not occur in large_string? What use is the return value? What if an occurrence "straddles a chunk boundary"? (The limit is 16 GB, not 16 MB.)Praetorian
You should add explanation.Kierkegaard

© 2022 - 2024 — McMap. All rights reserved.