Do I have to take the encoding into account when performing Boyer-Moore pattern matching?
Asked Answered
E

1

3

I'm about to implement a variation of the Boyer-Moore pattern matching algorithm (the Sunday algorithm to be specific) and I was asking myself: What is my alphabet size?

Does it depend on the encoding (= number of possible characters) or can I just assume my alphabet consists of 256 symbols (= number of symbols which can be represented by a byte)?

In many other situations treating characters as bytes would be a problem because depending on the encoding a character can consist of multiple bytes, but if in my case both strings have the same encoding then equal characters are represented by equal byte sequences, so I would assume it doesn't matter.

So: Do I have to take the encoding into account and assume an alphabet consisting of the actual characters (> 90000 for Unicode) or can I just handle the text and the pattern as a stream of bytes?

Electrothermics answered 19/5, 2011 at 8:41 Comment(0)
P
4

A multi-byte encoding can be used with a byte-oriented search routine IF it is self-synchronizing.

So, you can safely use Boyer-Moore with:

  • CESU-8
  • UTF-8
  • UTF-EBCDIC

But can NOT use it with:

  • Big5
  • EUC-JP
  • GBK / GB18030
  • ISO 2022
  • Johab
  • Punycode
  • Shift-JIS
  • UTF-7
  • UTF-16
  • UTF-32
Privity answered 20/5, 2011 at 5:50 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.