Regular Expression for extracting text from an RTF string
Asked Answered
T

11

47

I was looking for a way to remove text from and RTF string and I found the following regex:

({\\)(.+?)(})|(\\)(.+?)(\b)

However the resulting string has two right angle brackets "}"

Before: {\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\fnil\fcharset0 MS Shell Dlg 2;}{\f1\fnil MS Shell Dlg 2;}} {\colortbl ;\red0\green0\blue0;} {\*\generator Msftedit 5.41.15.1507;}\viewkind4\uc1\pard\tx720\cf1\f0\fs20 can u send me info for the call pls\f1\par }

After: } can u send me info for the call pls }

Any thoughts on how to improve the regex?

Edit: A more complicated string such as this one does not work: {\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\fnil\fcharset0 MS Shell Dlg 2;}} {\colortbl ;\red0\green0\blue0;} {\*\generator Msftedit 5.41.15.1507;}\viewkind4\uc1\pard\tx720\cf1\f0\fs20 HKEY_LOCAL_MACHINE\\SOFTWARE\\Microsoft\\test\\myapp\\Apps\\\{3423234-283B-43d2-BCE6-A324B84CC70E\}\par }

Thrifty answered 9/10, 2008 at 18:24 Comment(1)
It looks like using the Richtextbox is the official answer of Microsoft for this problem!Athwartships
A
65

In RTF, { and } marks a group. Groups can be nested. \ marks beginning of a control word. Control words end with either a space or a non alphabetic character. A control word can have a numeric parameter following, without any delimiter in between. Some control words also take text parameters, separated by ';'. Those control words are usually in their own groups.

I think I have managed to make a pattern that takes care of most the cases.

\{\*?\\[^{}]+}|[{}]|\\\n?[A-Za-z]+\n?(?:-?\d+)?[ ]?

It leaves a few spaces when run on your pattern though.


Going trough the RTF specification (some of it), I see that there are a lot of pitfalls for pure regex based strippers. The most obvious one are that some groups should be ignored (headers, footers, etc.), while others should be rendered (formatting).

I have written a Python script that should work better than my regex above:

def striprtf(text):
   pattern = re.compile(r"\\([a-z]{1,32})(-?\d{1,10})?[ ]?|\\'([0-9a-f]{2})|\\([^a-z])|([{}])|[\r\n]+|(.)", re.I)
   # control words which specify a "destionation".
   destinations = frozenset((
      'aftncn','aftnsep','aftnsepc','annotation','atnauthor','atndate','atnicn','atnid',
      'atnparent','atnref','atntime','atrfend','atrfstart','author','background',
      'bkmkend','bkmkstart','blipuid','buptim','category','colorschememapping',
      'colortbl','comment','company','creatim','datafield','datastore','defchp','defpap',
      'do','doccomm','docvar','dptxbxtext','ebcend','ebcstart','factoidname','falt',
      'fchars','ffdeftext','ffentrymcr','ffexitmcr','ffformat','ffhelptext','ffl',
      'ffname','ffstattext','field','file','filetbl','fldinst','fldrslt','fldtype',
      'fname','fontemb','fontfile','fonttbl','footer','footerf','footerl','footerr',
      'footnote','formfield','ftncn','ftnsep','ftnsepc','g','generator','gridtbl',
      'header','headerf','headerl','headerr','hl','hlfr','hlinkbase','hlloc','hlsrc',
      'hsv','htmltag','info','keycode','keywords','latentstyles','lchars','levelnumbers',
      'leveltext','lfolevel','linkval','list','listlevel','listname','listoverride',
      'listoverridetable','listpicture','liststylename','listtable','listtext',
      'lsdlockedexcept','macc','maccPr','mailmerge','maln','malnScr','manager','margPr',
      'mbar','mbarPr','mbaseJc','mbegChr','mborderBox','mborderBoxPr','mbox','mboxPr',
      'mchr','mcount','mctrlPr','md','mdeg','mdegHide','mden','mdiff','mdPr','me',
      'mendChr','meqArr','meqArrPr','mf','mfName','mfPr','mfunc','mfuncPr','mgroupChr',
      'mgroupChrPr','mgrow','mhideBot','mhideLeft','mhideRight','mhideTop','mhtmltag',
      'mlim','mlimloc','mlimlow','mlimlowPr','mlimupp','mlimuppPr','mm','mmaddfieldname',
      'mmath','mmathPict','mmathPr','mmaxdist','mmc','mmcJc','mmconnectstr',
      'mmconnectstrdata','mmcPr','mmcs','mmdatasource','mmheadersource','mmmailsubject',
      'mmodso','mmodsofilter','mmodsofldmpdata','mmodsomappedname','mmodsoname',
      'mmodsorecipdata','mmodsosort','mmodsosrc','mmodsotable','mmodsoudl',
      'mmodsoudldata','mmodsouniquetag','mmPr','mmquery','mmr','mnary','mnaryPr',
      'mnoBreak','mnum','mobjDist','moMath','moMathPara','moMathParaPr','mopEmu',
      'mphant','mphantPr','mplcHide','mpos','mr','mrad','mradPr','mrPr','msepChr',
      'mshow','mshp','msPre','msPrePr','msSub','msSubPr','msSubSup','msSubSupPr','msSup',
      'msSupPr','mstrikeBLTR','mstrikeH','mstrikeTLBR','mstrikeV','msub','msubHide',
      'msup','msupHide','mtransp','mtype','mvertJc','mvfmf','mvfml','mvtof','mvtol',
      'mzeroAsc','mzeroDesc','mzeroWid','nesttableprops','nextfile','nonesttables',
      'objalias','objclass','objdata','object','objname','objsect','objtime','oldcprops',
      'oldpprops','oldsprops','oldtprops','oleclsid','operator','panose','password',
      'passwordhash','pgp','pgptbl','picprop','pict','pn','pnseclvl','pntext','pntxta',
      'pntxtb','printim','private','propname','protend','protstart','protusertbl','pxe',
      'result','revtbl','revtim','rsidtbl','rxe','shp','shpgrp','shpinst',
      'shppict','shprslt','shptxt','sn','sp','staticval','stylesheet','subject','sv',
      'svb','tc','template','themedata','title','txe','ud','upr','userprops',
      'wgrffmtfilter','windowcaption','writereservation','writereservhash','xe','xform',
      'xmlattrname','xmlattrvalue','xmlclose','xmlname','xmlnstbl',
      'xmlopen',
   ))
   # Translation of some special characters.
   specialchars = {
      'par': '\n',
      'sect': '\n\n',
      'page': '\n\n',
      'line': '\n',
      'tab': '\t',
      'emdash': u'\u2014',
      'endash': u'\u2013',
      'emspace': u'\u2003',
      'enspace': u'\u2002',
      'qmspace': u'\u2005',
      'bullet': u'\u2022',
      'lquote': u'\u2018',
      'rquote': u'\u2019',
      'ldblquote': u'\201C',
      'rdblquote': u'\u201D', 
   }
   stack = []
   ignorable = False       # Whether this group (and all inside it) are "ignorable".
   ucskip = 1              # Number of ASCII characters to skip after a unicode character.
   curskip = 0             # Number of ASCII characters left to skip
   out = []                # Output buffer.
   for match in pattern.finditer(text):
      word,arg,hex,char,brace,tchar = match.groups()
      if brace:
         curskip = 0
         if brace == '{':
            # Push state
            stack.append((ucskip,ignorable))
         elif brace == '}':
            # Pop state
            ucskip,ignorable = stack.pop()
      elif char: # \x (not a letter)
         curskip = 0
         if char == '~':
            if not ignorable:
                out.append(u'\xA0')
         elif char in '{}\\':
            if not ignorable:
               out.append(char)
         elif char == '*':
            ignorable = True
      elif word: # \foo
         curskip = 0
         if word in destinations:
            ignorable = True
         elif ignorable:
            pass
         elif word in specialchars:
            out.append(specialchars[word])
         elif word == 'uc':
            ucskip = int(arg)
         elif word == 'u':
            c = int(arg)
            if c < 0: c += 0x10000
            if c > 127: out.append(unichr(c))
            else: out.append(chr(c))
            curskip = ucskip
      elif hex: # \'xx
         if curskip > 0:
            curskip -= 1
         elif not ignorable:
            c = int(hex,16)
            if c > 127: out.append(unichr(c))
            else: out.append(chr(c))
      elif tchar:
         if curskip > 0:
            curskip -= 1
         elif not ignorable:
            out.append(tchar)
   return ''.join(out)

It works by parsing the RTF code, and skipping any groups which has a "destination" specified, and all "ignorable" groups ({\*...}). I also added handling of some special characters.

There are lots of features missing to make this a full parser, but should be enough for simple documents.

UPDATED: This url have this script updated to run on Python 3.x:

https://gist.github.com/gilsondev/7c1d2d753ddb522e7bc22511cfb08676

Adin answered 9/10, 2008 at 19:42 Comment(16)
Nice answer. \~ is the non-breaking space, so should not char=='~' append u'\u00a0?Corelation
Cool. Also I think (but I'm not sure) that you can change the last group to [^{}\] and just eat all the text at once instead of one char at a time. I think that once the parser is reading text, it's safe to consume everything up to the next RTF metacharacter, i.e., curly braces or backslash.Corelation
It depends on which destination is specified. It is probably possible to optimize it in some places, but it is easy to miss some edge-case. It is often better to make it more robust.Adin
Another note to implementers: backslash followed by a carriage return or line feed is in fact a control word that should be handled equivalently to \par. So the first group might be: ([a-z]{1,32}|[\r\n])...I've been spending way too much time in the RTF spec...why couldn't Microsoft have used XML‽Corelation
oh okay. Do you know an example of a destination where that would not work? I'm just curious as I'm not familiar with it. Thanks.Corelation
Actually looking over your regular expression, the alternatives fall in only three categories: 1) begins with backslash, 2) is a brace, or 3) is one or more linebreak characters. I think this implies that the last capture could safely be changed to [^{}\\\r\n], since those characters are the only ones that could possibly cause the regex to match before reaching the last alternative.Corelation
I just noticed that your fourth capture ("char") would grab the escaped linebreaks that should be treated as a \par. A very good answer, thanks again!Corelation
Posted C#/.Net translation here: chrisbenard.net/2014/08/20/Extract-Text-from-RTF-in-.NetMahaffey
Thanks for this script. I noticed, that on slightly deformed RTF, this script will not extract all text, example in a gist. OS X TextEdit had problems too, only unrtf seemed to work on this input, which is from an actual (probably broken ;) program.Hentrich
I have modified @Chris Benard's C# translation, added non-unicode encoding support by font table processing and done some performance improvements like using HashSet<T> istead of List<T>.nthdeveloper/RichTextStripperRoyden
@GilsonFilho I tried to use the script above and for each line the first character is getting removed as well. when I open the rtf file in notepad it's those lines that start with \par. How could I resolve this issue? Thanks!Fireplace
This worked well for my needs, though I did have to add a try to avoid some pop exceptions. # Pop state try: ucskip,ignorable = stack.pop() except: continueRemontant
How does the python3 example work? I've copied the script to my computer. Now what?Biocellate
with open('data.rtf', 'r') as file: rtf = file.read() text = striprtf(rtf) print(text)Adin
Thank you! For my implementation in javascript, I added a specialchar for 'cell', otherwise there was no separation between the cells in my documentGalatia
I also removed 'listtext' from the destinations list, otherwise there were no numbering in lists.Galatia
H
7

I've used this before and it worked for me:

\\\w+|\{.*?\}|}

You will probably want to trim the ends of the result to get rid of the extra spaces left over.

Hoad answered 9/10, 2008 at 19:8 Comment(0)
L
7

So far, we haven't found a good answer to this either, other than using a RichTextBox control:

    /// <summary>
    /// Strip RichTextFormat from the string
    /// </summary>
    /// <param name="rtfString">The string to strip RTF from</param>
    /// <returns>The string without RTF</returns>
    public static string StripRTF(string rtfString)
    {
        string result = rtfString;

        try
        {
            if (IsRichText(rtfString))
            {
                // Put body into a RichTextBox so we can strip RTF
                using (System.Windows.Forms.RichTextBox rtfTemp = new System.Windows.Forms.RichTextBox())
                {
                    rtfTemp.Rtf = rtfString;
                    result = rtfTemp.Text;
                }
            }
            else
            {
                result = rtfString;
            }
        }
        catch
        {
            throw;
        }

        return result;
    }

    /// <summary>
    /// Checks testString for RichTextFormat
    /// </summary>
    /// <param name="testString">The string to check</param>
    /// <returns>True if testString is in RichTextFormat</returns>
    public static bool IsRichText(string testString)
    {
        if ((testString != null) &&
            (testString.Trim().StartsWith("{\\rtf")))
        {
            return true;
        }
        else
        {
            return false;
        }
    }

Edit: Added IsRichText method.

Lifesaving answered 26/8, 2010 at 16:17 Comment(0)
A
5

I made this helper function to do this in JavaScript. So far this has worked well for simple RTF formatting removal for me.

function stripRtf(str){
    var basicRtfPattern = /\{\*?\\[^{}]+;}|[{}]|\\[A-Za-z]+\n?(?:-?\d+)?[ ]?/g;
    var newLineSlashesPattern = /\\\n/g;
    var ctrlCharPattern = /\n\\f[0-9]\s/g;

    //Remove RTF Formatting, replace RTF new lines with real line breaks, and remove whitespace
    return str
        .replace(ctrlCharPattern, "")
        .replace(basicRtfPattern, "")
        .replace(newLineSlashesPattern, "\n")
        .trim();
}

Of Note:

  • I slightly modified the regex written by @Markus Jarderot above. It now removes slashes at the end of new lines in two step to avoid a more complex regex.
  • .trim() is only supported in newer browsers. If you need to have support for these then see this: Trim string in JavaScript?

EDIT: I've updated the regex to work around some issues I've found since posting this originally. I'm using this in a project, see it in context here: https://github.com/chrismbarr/LyricConverter/blob/865f17613ee8f43fbeedeba900009051c0aa2826/scripts/parser.js#L26-L37

Angularity answered 16/1, 2013 at 4:2 Comment(0)
M
4

Regex won't never 100% solve this problem, you need a parser. Check this implementation in CodeProject (it's in C# though): http://www.codeproject.com/Articles/27431/Writing-Your-Own-RTF-Converter

Mag answered 15/3, 2012 at 10:0 Comment(0)
D
2

According to RegexPal, the two }'s are the ones bolded below:

{\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\fnil\fcharset0 MS Shell Dlg 2;}{\f1\fnil MS Shell Dlg 2;}} {\colortbl ;\red0\green0\blue0;} {\generator Msftedit 5.41.15.1507;}\viewkind4\uc1\pard\tx720\cf1\f0\fs20 can u send me info for the call pls\f1\par }

I was able to fix the first curly brace by adding a plus sign to the regex:

({\\)(.+?)(}+)|(\\)(.+?)(\b)
            ^
     plus sign added here

And to fix the curly brace at the end, I did this:

({\\)(.+?)(})|(\\)(.+?)(\b)|}$
                            ^
         this checks if there is a curly brace at the end

I don't know the RTF format very well so this might not work in all cases, but it works on your example...

Dayna answered 9/10, 2008 at 18:53 Comment(0)
C
2

Late contributor but the regex below helped us with the RTF code we found in our DB (we're using it within an RDL via SSRS).

This expression removed it for our team. Although it may just resolve our specific RTF, it may be a helpful base for someone. Although this webby is incredible handy for live testing.

http://regexpal.com/

{\*?\\.+(;})|\s?\\[A-Za-z0-9]+|\s?{\s?\\[A-Za-z0-9]+\s?|\s?}\s?

Hope this helps, K

Constitutionality answered 24/2, 2014 at 11:54 Comment(0)
T
1

None of the answers were sufficient, so my solution was to use the RichTextBox control (yes, even in a non-Winform app) to extract text from RTF

Thrifty answered 9/4, 2009 at 21:10 Comment(0)
R
1

The following solution allows you to extract text from an RTF string:

FareRule = Encoding.ASCII.GetString(FareRuleInfoRS.Data);
    System.Windows.Forms.RichTextBox rtf = new System.Windows.Forms.RichTextBox();
    rtf.Rtf = FareRule;
    FareRule = rtf.Text;
Ragi answered 9/2, 2010 at 13:44 Comment(1)
Note: This will not work with partial RTF strings. The RichTextBox control requires a full well formed RTF inputPenley
M
1

Here's an Oracle SQL statement that can strip RTF from an Oracle field:

SELECT REGEXP_REPLACE(
    REGEXP_REPLACE(
        CONTENT,
        '\\(fcharset|colortbl)[^;]+;', ''
    ),
    '(\\[^ ]+ ?)|[{}]', ''
) TEXT
FROM EXAMPLE WHERE CONTENT LIKE '{\rtf%';

This is designed for data from Windows rich text controls, not RTF files. Limitations are:

  • \{ and \} are not replaced with { and }
  • Headers and footers are not handled specially
  • Images and other embedded objects are not handled specially (no idea what will happen if one of these is encountered!)

It works by first removing the \fcharset and \colourtbl tags, which are special because data follows them until ; is reached. Then it removes all the \xxx tags (including a single optional trailing space), followed by all the { and } characters. This handles most simple RTF such as what you get from the rich text control.

Merla answered 3/3, 2017 at 6:54 Comment(0)
G
1

if anyone is still looking for a solution; here it is https://pypi.org/project/striprtf/

Graber answered 5/12, 2022 at 18:41 Comment(1)
wow, thanks bro. i was trying pypi.org/project/rtfparse but it needs a file path, but the one you give doesn't and it is 3 lines of code.Archive

© 2022 - 2024 — McMap. All rights reserved.