The insight behind Boyer-Moore is that if you start searching for a pattern in a string starting with the last character in the pattern, you can jump your search forward multiple characters when you hit a mismatch.
Let's say our pattern p
is the sequence of characters p1
, p2
, ..., pn
and we are searching a string s
, currently with p
aligned so that pn
is at index i
in s
.
E.g.:
s = WHICH FINALLY HALTS. AT THAT POINT...
p = AT THAT
i = ^
The B-M paper makes the following observations:
(1) if we try matching a character that is not in p
then we can jump forward n
characters:
'F' is not in p
, hence we advance n
characters:
s = WHICH FINALLY HALTS. AT THAT POINT...
p = AT THAT
i = ^
(2) if we try matching a character whose last position is k
from the end of p
then we can jump forward k
characters:
' 's last position in p
is 4 from the end, hence we advance 4 characters:
s = WHICH FINALLY HALTS. AT THAT POINT...
p = AT THAT
i = ^
Now we scan backwards from i
until we either succeed or we hit a mismatch.
(3a) if the mismatch occurs k
characters from the start of p
and the mismatched character is not in p
, then we can advance (at least) k
characters.
'L' is not in p
and the mismatch occurred against p6
, hence we can advance (at least) 6 characters:
s = WHICH FINALLY HALTS. AT THAT POINT...
p = AT THAT
i = ^
However, we can actually do better than this.
(3b) since we know that at the old i
we'd already matched some characters (1 in this case). If the matched characters don't match the start of p
, then we can actually jump forward a little more (this extra distance is called 'delta2' in the paper):
s = WHICH FINALLY HALTS. AT THAT POINT...
p = AT THAT
i = ^
At this point, observation (2) applies again, giving
s = WHICH FINALLY HALTS. AT THAT POINT...
p = AT THAT
i = ^
and bingo! We're done.