There's a 1 gigabyte string of arbitrary data which you can assume to be equivalent to something like:
1_gb_string=os.urandom(1*gigabyte)
We will be searching this string, 1_gb_string
, for an infinite number of fixed width, 1 kilobyte patterns, 1_kb_pattern
. Every time we search the pattern will be different. So caching opportunities are not apparent. The same 1 gigabyte string will be searched over and over. Here is a simple generator to describe what's happening:
def findit(1_gb_string):
1_kb_pattern=get_next_pattern()
yield 1_gb_string.find(1_kb_pattern)
Note that only the first occurrence of the pattern needs to be found. After that, no other major processing should be done.
What can I use that's faster than Python's built-in find()
for matching 1 KB patterns against 1 GB or greater data strings, memory requirements limited to 16 GB?
(I am already aware of how to split up the string and searching it in parallel, so you can disregard that basic optimization.)