Do I have to take the encoding into account when performing Boyer-Moore pattern matching?

I'm about to implement a variation of the Boyer-Moore pattern matching algorithm (the Sunday algorithm to be specific) and I was asking myself: What is my alphabet size?

Does it depend on the encoding (= number of possible characters) or can I just assume my alphabet consists of 256 symbols (= number of symbols which can be represented by a byte)?

In many other situations treating characters as bytes would be a problem because depending on the encoding a character can consist of multiple bytes, but if in my case both strings have the same encoding then equal characters are represented by equal byte sequences, so I would assume it doesn't matter.

So: Do I have to take the encoding into account and assume an alphabet consisting of the actual characters (> 90000 for Unicode) or can I just handle the text and the pattern as a stream of bytes?

Recommended topics

Hot tags