How can you convert RTF text to Markdown-syntaxed plain text in Cocoa?
Asked Answered
M

3

8

I need to be able to convert RTF or HTML to Markdown-syntaxed plain text for uploading to my server. I need to achieve this in Cocoa/Obj-C 2.0. Does anyone know how to do this?

Thanks so much —» Alex.


Edited Thu 4:53 PM

Umm. In answer to Yuji's comment, I'm trying to make an NSStatusItem droplet that accepts text. It doesn't matter what format the text is in, but I need to be able to format it either as plain text or plain text formatted with Markdown. I guess since I don't know what kind of text I'll be receiving...

Mayorga answered 20/5, 2010 at 18:28 Comment(1)
How much fidelity do you need? HTML/RTF has more feature than a Markdown document...Kisung
J
2

Oooph, this is going to be tricky. As Yuji said, you can express a lot more in HTML/RTF than in markdown. That being the case...

I'd convert the content into an NSAttributedString. You can easily construct an NSAttributedString from RTF data; HTML will be much more difficult. Once you do that, however, it'll be a matter of inspecting all the attributes on the string and applying the equivalent markdown to a plaintext version of the content.

Researching a bit more:

  • Markdownify - convert HTML to Markdown in PHP
  • Pandoc - convert markdown (and some formats) to other rich text formats. It supports Markdown => RTF, so you could perhaps use that to create an inverse conversion.
Jaconet answered 20/5, 2010 at 18:47 Comment(1)
Although this is correct, I decided to just stick with plain text. Thanks anyway! :)Mayorga
Y
4

Here are the formats pandoc parses and writes:

> pandoc --help
pandoc [OPTIONS] [FILES]

Input formats:  native, markdown, markdown+lhs, rst, rst+lhs, html, 
latex, latex+lhs

Output formats:  native, html, html+lhs, s5, docbook, opendocument, odt, latex, 
latex+lhs, context, texinfo, man, markdown, markdown+lhs, plain, rst, rst+lhs, 
mediawiki, rtf

Unfortunately rtf isn't one of the formats it parses. It is a Haskell program, so it isn't convenient to get it without installing the Haskell Platform. From a parsed document, it can write a sort of 'plain' sub-Markdown, or standard Markdown, or its own enriched Markdown, as well as a pile of other formats. The internal ('native') representation is much richer than the standard Markdown spec requires, so less information will be lost, and you will be able to recover the html for your markdown -- or make a pdf via latex, etc. It is fairly easy to hack at it for special purposes.

I don't know if any of them are stable but there is an increasing number of bindings to the Pandoc libraries from other languages around. A search of Github suggests that the most relevant looking for hooking up with Obj C is the plain C libpandoc. Ruby has the most activity, it seems -- I guess because it's github -- with pandoku, pandoc-ruby, rails-pandoc and so forth.

Yardley answered 21/5, 2010 at 3:11 Comment(0)
J
2

Oooph, this is going to be tricky. As Yuji said, you can express a lot more in HTML/RTF than in markdown. That being the case...

I'd convert the content into an NSAttributedString. You can easily construct an NSAttributedString from RTF data; HTML will be much more difficult. Once you do that, however, it'll be a matter of inspecting all the attributes on the string and applying the equivalent markdown to a plaintext version of the content.

Researching a bit more:

  • Markdownify - convert HTML to Markdown in PHP
  • Pandoc - convert markdown (and some formats) to other rich text formats. It supports Markdown => RTF, so you could perhaps use that to create an inverse conversion.
Jaconet answered 20/5, 2010 at 18:47 Comment(1)
Although this is correct, I decided to just stick with plain text. Thanks anyway! :)Mayorga
T
2

There's an online form that does just this: MarkItDown

Tallulah answered 21/10, 2013 at 7:55 Comment(1)
Thanks — this did the trick for me where my first couple runs through with Pandoc failed, by leaving a lot of extra junk from the Microsoft Office source file in the text.Chlorite

© 2022 - 2024 — McMap. All rights reserved.