What's the easiest way to convert an SO data dump from HTML back to Markdown?
Asked Answered
W

2

6

I've just got my hands on a Stackoverflow data dump, and I'm disappointed to see that the Body field of the posts is in HTML rather than Markdown. I suspect there's Markdown in the original database because that's what I see if I try to edit an answer.

I want to recover Markdown from a large set of answers. I will be processing hundreds of entries in batch mode, using either command-line tools or some kind of Lua or C library, so an interactive tool like the wmd Markdown editor is not suitable. Can people say what tools are available to help me recover Markdown from a Stackoverflow data dump?


(Related question, not a duplicate: Convert HTML back to Markdown within wmd.)

Whitman answered 20/8, 2009 at 17:23 Comment(0)
G
5

Markdownify converts HTML to Markdown.

See Also: MetaSO / Can Markdown be recovered from the SO data dump?

Gandhi answered 20/8, 2009 at 17:26 Comment(2)
When it comes to using PHP on the command line, I am a troglodyte. I can't seem to figure out from the manual if there is a library function to read the entire contents of a file. Is dio_read(STDIN) on the right track?Whitman
If you want to read the contents of a file, there are many ways - a simple function that does it is file_get_contents();Gandhi
J
2

take a look at pandoc:http://johnmacfarlane.net/pandoc/

there is an html2markdown tool included with pandoc that works pretty well, and the program is run from the command line, making batch conversion quite nice.

here is the man page: http://johnmacfarlane.net/pandoc/html2markdown.1.html

Josie answered 15/9, 2009 at 16:37 Comment(1)
Looks awesome! I will definitely check it out.Whitman

© 2022 - 2024 — McMap. All rights reserved.