Parsing RTF Documents with Java/JavaCC
Asked Answered
C

2

9

Is anybody familiar with the the RTF document format and parsing using any Java libaries. The standard way people have done this is by using the RTFEditorKit in the JDK Swing API:

Swing RTFEditorKit API

but it isn't that accurate when it comes to parsing RTF documents. In fact there's a comment in the API:

The RTF support was not written by the Swing team. In the future we hope to improve the support provided.

I don't think I'm going to wait for this to happen :)

The other approach taken is to define a grammar using JavaCC and generate a parser. This works better, but I'm having trouble finding a complete grammar. I've tried:

PMD Applied JavaCC Grammar

which is ok and the following (which is the best so far).

Koders RTFParserDelegate and ETranslate Grammar

There are various implementations of the ETranslate grammar about (I know the Nutch API may use this). Does anybody know which is the most accurate grammar or whether there is a better approach to this?

I could start ploughing through the JavaCC docs to understand the .jj files and test it against the RTF files... this is my current approach, but it's taking a while... any help would be appreciated

Clydesdale answered 12/5, 2009 at 18:55 Comment(5)
Can't answer your actual question, but it seems like a better validation approach (rather than working through the grammar) is to create test files and verify that they're properly parsed. However, as I recall, RTF parsers are permitted to ignore any constructs they don't understand, allowing for backwards compatibility.Prow
The ETranslate parser actually does very well at extracting RTF documents (99% of the set I have), but it's unsupported and not available from a central source. Will try and get this up on Google Code somewhere... not sure about licenses, it just needs some bug fixing in terms of the grammar I believe...Clydesdale
did you make any progress with this?Gamone
Ended up using the basic Swing RTF Editor and falling back to pmdapplied.com/RTFParser.jj if yuo have the time I suggest taking that and modifying the parsing logic...Clydesdale
FWIW I've added a copy of the etranslate parser here: github.com/tmyroadctfig/com.etranslate.tm.processing.rtfCallow
E
1

Does anybody know which is the most accurate grammar or whether there is a better approach to this?

Many years ago I spent some time reading RTF (Wikipedia) with C#. I say reading because if you understand RTF in detail and use it the way it was designed you will realize that RTF is not meant to be read as a whole and parsed as a whole over and over again when editing. In the documentation you will find the syntax for RTF, but don't be misled into believing that you should use a lexer/parser. In the documentation they give a sample reader for RTF.

Remember that RTF was created many ages ago when memory was measured in KB and not MB, and editing long documents of several hundred pages in a conventional way would tax system resources. So RFT has the ability to be edited in smaller subsections without loading or modifying the entire document. This is what gives it the ability to work on such large documents with limited memory. It is also why the syntax may seem odd at first.

Eudemon answered 11/3, 2013 at 12:59 Comment(0)
A
0

Presumably, the source of OpenOffice contains what you're looking for.

Antiphonal answered 13/5, 2009 at 11:46 Comment(1)
I'm already looked an OpenOffice and submitting documents to it with JODExtractor, it's a good way of parsing the documents but a rather heavyweight solution since you need a server with X libraries installed etc... haven't ruled it out yet, still investigating, but looking at more "lightweight" solutions.Clydesdale

© 2022 - 2024 — McMap. All rights reserved.