In RTF, { and } marks a group. Groups can be nested. \ marks beginning of a control word. Control words end with either a space or a non alphabetic character. A control word can have a numeric parameter following, without any delimiter in between. Some control words also take text parameters, separated by ';'. Those control words are usually in their own groups.
I think I have managed to make a pattern that takes care of most the cases.
\{\*?\\[^{}]+}|[{}]|\\\n?[A-Za-z]+\n?(?:-?\d+)?[ ]?
It leaves a few spaces when run on your pattern though.
Going trough the RTF specification (some of it), I see that there are a lot of pitfalls for pure regex based strippers. The most obvious one are that some groups should be ignored (headers, footers, etc.), while others should be rendered (formatting).
I have written a Python script that should work better than my regex above:
def striprtf(text):
pattern = re.compile(r"\\([a-z]{1,32})(-?\d{1,10})?[ ]?|\\'([0-9a-f]{2})|\\([^a-z])|([{}])|[\r\n]+|(.)", re.I)
# control words which specify a "destionation".
destinations = frozenset((
# Translation of some special characters.
specialchars = {
'par': '\n',
'sect': '\n\n',
'page': '\n\n',
'line': '\n',
'tab': '\t',
'emdash': u'\u2014',
'endash': u'\u2013',
'emspace': u'\u2003',
'enspace': u'\u2002',
'qmspace': u'\u2005',
'bullet': u'\u2022',
'lquote': u'\u2018',
'rquote': u'\u2019',
'ldblquote': u'\201C',
'rdblquote': u'\u201D',
stack = []
ignorable = False # Whether this group (and all inside it) are "ignorable".
ucskip = 1 # Number of ASCII characters to skip after a unicode character.
curskip = 0 # Number of ASCII characters left to skip
out = [] # Output buffer.
for match in pattern.finditer(text):
word,arg,hex,char,brace,tchar = match.groups()
if brace:
curskip = 0
if brace == '{':
# Push state
elif brace == '}':
# Pop state
ucskip,ignorable = stack.pop()
elif char: # \x (not a letter)
curskip = 0
if char == '~':
if not ignorable:
elif char in '{}\\':
if not ignorable:
elif char == '*':
ignorable = True
elif word: # \foo
curskip = 0
if word in destinations:
ignorable = True
elif ignorable:
elif word in specialchars:
elif word == 'uc':
ucskip = int(arg)
elif word == 'u':
c = int(arg)
if c < 0: c += 0x10000
if c > 127: out.append(unichr(c))
else: out.append(chr(c))
curskip = ucskip
elif hex: # \'xx
if curskip > 0:
curskip -= 1
elif not ignorable:
c = int(hex,16)
if c > 127: out.append(unichr(c))
else: out.append(chr(c))
elif tchar:
if curskip > 0:
curskip -= 1
elif not ignorable:
return ''.join(out)
It works by parsing the RTF code, and skipping any groups which has a "destination" specified, and all "ignorable" groups ({\*
). I also added handling of some special characters.
There are lots of features missing to make this a full parser, but should be enough for simple documents.
UPDATED: This url have this script updated to run on Python 3.x: