Is there a Python module for converting RTF to plain text? [closed]
Asked Answered
O

10

40

Ideally, I'd like a module or library that doesn't require superuser access to install; I have limited privileges in my working environment.

Opsonin answered 26/8, 2009 at 20:56 Comment(2)
You can install Python packages with easy_install and the --user option without permissions.Pidgin
This answer is the best... it works for me like a charm!!Pyles
M
7

OpenOffice has a RTF reader. You can use python to script OpenOffice, see here for more info.

You could probably try using the magic com-object on Windows to read anything that smells ms-binary. I wouldn't recommend that though.

Actually parsing the raw data probably won't be very hard, see this example written in .bat/QBasic.

DocFrac is a free open source converter betweeen RTF, HTML and text. Windows, Linux, ActiveX and DLL platforms available. It will probably be pretty easy to wrap it up in python.

RTF::TEXT::Converter - Perl extension for converting RTF into text. (in case You have problems withg DocFrac).

Official Rich Text Format (RTF) Specifications, version 1.7, by Microsoft.

Good luck (with the limited privileges in Your working environment).

Meliorate answered 26/8, 2009 at 22:10 Comment(6)
Thanks. I opened the document in OpenOffice and saved it as a plain text file. This was probably the simplest approach. And thanks for reminding me that it's My work environment. I asked for sudo access.Opsonin
The link to RTF::TEXT::Converter is broken. So is the link to the discussion on the python mailing list. That is why link-answers are discouraged...Tantalize
thanks for pointing it out, I fixed one of the links. Sadly the other one had to be deleted.Carve
DocFrac still works, but does not support pt-br special chars.Parasol
Microsoft's RTF specification now lives at: download.microsoft.com/download/5/d/d/…Diastrophism
@JulianMehnle that appears to be just extensions, not the whole spec. That's at interoperability.blob.core.windows.net/files/Archive_References/…Condescending
V
51

I've been working on a library called Pyth, which can do this:

http://pypi.python.org/pypi/pyth/

Converting an RTF file to plaintext looks something like this:

from pyth.plugins.rtf15.reader import Rtf15Reader
from pyth.plugins.plaintext.writer import PlaintextWriter

doc = Rtf15Reader.read(open('sample.rtf'))

print PlaintextWriter.write(doc).getvalue()

Pyth can also generate RTF files, read and write XHTML, generate documents from Python markup a la Nevow's stan, and has limited experimental support for latex and pdf output. Its RTF support is pretty robust -- we use it in production to read RTF files generated by various versions of Word, OpenOffice, Mac TextEdit, EIOffice, and others.

Vercelli answered 30/11, 2009 at 18:7 Comment(3)
Shame it's not Python 3 compatible ;-(Boylston
@Epoc, there is some work towards make it compatible to Python 3. I have one fork in my repo that you can install with pip install git+https://github.com/robertour/pyth@pyth-py3. You can see some of the discussion here.Noah
In 2022, pyth still is only available for Python 2, and has not seen a release since 2014Lamia
M
7

OpenOffice has a RTF reader. You can use python to script OpenOffice, see here for more info.

You could probably try using the magic com-object on Windows to read anything that smells ms-binary. I wouldn't recommend that though.

Actually parsing the raw data probably won't be very hard, see this example written in .bat/QBasic.

DocFrac is a free open source converter betweeen RTF, HTML and text. Windows, Linux, ActiveX and DLL platforms available. It will probably be pretty easy to wrap it up in python.

RTF::TEXT::Converter - Perl extension for converting RTF into text. (in case You have problems withg DocFrac).

Official Rich Text Format (RTF) Specifications, version 1.7, by Microsoft.

Good luck (with the limited privileges in Your working environment).

Meliorate answered 26/8, 2009 at 22:10 Comment(6)
Thanks. I opened the document in OpenOffice and saved it as a plain text file. This was probably the simplest approach. And thanks for reminding me that it's My work environment. I asked for sudo access.Opsonin
The link to RTF::TEXT::Converter is broken. So is the link to the discussion on the python mailing list. That is why link-answers are discouraged...Tantalize
thanks for pointing it out, I fixed one of the links. Sadly the other one had to be deleted.Carve
DocFrac still works, but does not support pt-br special chars.Parasol
Microsoft's RTF specification now lives at: download.microsoft.com/download/5/d/d/…Diastrophism
@JulianMehnle that appears to be just extensions, not the whole spec. That's at interoperability.blob.core.windows.net/files/Archive_References/…Condescending
F
3

Have you checked out pyrtf-ng?

Update: The parsing functionality is available if you do a Subversion checkout, but I'm not sure how full-featured it is. (Look in the rtfng.parser.base module.)

Featherbedding answered 26/8, 2009 at 21:1 Comment(0)
G
3

If you are on Mac , you can convert an RTF file file.rtf to TXT from the CLI like:

textutil -convert txt file.rtf
Gao answered 3/8, 2019 at 18:32 Comment(0)
V
2

Here's a link to a script that converts rtf to text using regex: Regular Expression for extracting text from an RTF string

Also, and updated link on github: Github link

Vender answered 28/6, 2016 at 20:57 Comment(0)
S
1

There is good library pyrtf-ng for all-purpose RTF handling.

Sanson answered 26/8, 2009 at 21:1 Comment(2)
Thanks, but the problem with pyrtf-ng is that it's useful for generating RTF files, not parsing them. I downloaded it from its SourceForge page (there is nothing under the Download tab at Google Code), and this is the only functionality I could find.Opsonin
@tony, have you looked at code.google.com/p/pyrtf-ng/source/browse/#svn/trunk/rtfng/… ? When there are no downloads yet on a Google Code hosted project, browse the sources!-)Giffer
K
1

PyRTF-ng 0.9.1 has not parsed any of my RTF documents, both with the ParsingException. First document was generated with OpenOffice 3.4, the second one with Mac TextEdit.

Pyth 0.5.6 parsed without problems both documents, but has not processed cyrillic symbols properly.

But each editor opens other's editor document correctly and without trouble, so all libraries seems to have a weak rtf support.

So I'm writing my own parser with with blackjack and hookers.

(I've uploaded both files, so you can check RTF libraries by yourself: http://yadi.sk/d/RMHawVdSD8O9 http://yadi.sk/d/RmUaSe5tD8OD)

Kane answered 15/8, 2012 at 8:22 Comment(1)
links dead, do you still have them?Cyano
L
1

I just came across pyrtflib - there's not much (any) documentation on it, it's kinda a case of installing it and then using the inbuilt help() function to find out what's available and what everything does.

Having said that in my little trial run of its rtf.Rtf2Html.getHtml() function it went well enough. I haven't tried the Rtf2Txt function but given the simpler nature of converting rtf to plaintext it should do fine I'd expect.

Levesque answered 24/4, 2015 at 8:24 Comment(1)
Have since given the Rtf2Txt.getText() function a go and it worked fine - my use of it was not an exhaustive edge-case torture test by any means but all the cases that I did test resulted in it giving me the expected outputLevesque
H
-2

I ran into the same thing ans I was trying to code it myself. It's not that easy but here is what I had when I decided to go for a commandline app. Its ruby but you can adapt to python very easily. There is some header garbage to clean up, but you can see more or less the idea.

f = File.open('r.rtf','r')
 b=0
 p=false
 str = ''
 begin
    while (char = f.readchar)
        if char.chr=='{'
   b+=1 
   next
  end
        if char.chr=='}'
   b-=1 
   next
  end
  if char.chr=='\\'
   p=true
   next
  end
  if p==true && (char.chr==' ' or char.chr=='\n' or char.chr=='\t' or char.chr=='\r')
   p=false 
   next
  end
  if p==true && (char.chr=='\'')
#this is the source of my headaches. you need to read the code page from the header and encode this.
   p=false 
   str << '#'
   next
  end
  next if b>2
  next if p
  str << char.chr
    end
rescue EOFError
end
f.close
Happiness answered 15/10, 2009 at 17:22 Comment(1)
pascal and python... In the SAME code!Preform
O
-2

Conversely, if you want to write RTFs easily from Python, you can use the third-party module rtflib. It's a fairly new and incomplete module but still very powerful and useful. Below is an example that writes "hello world" in rich text to an RTF called helloworld.rtf. This is a very primitive example, and the module can also be used to add colors, italics, tables, and many other aspects of rich text to RTF files.

from rtflib import *
file = RTF("helloworld.rtf")
file.startfile()
file.addstrict()
file.addtext("hello world")
file.writeout()
Oistrakh answered 15/6, 2011 at 5:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.