How can doc/docx files be converted to markdown or structured text?
Asked Answered
A

12

123

Is there a program or workflow to convert .doc or .docx files to Markdown or similar text?

PS: Ideally, I would welcome the option that a specific font (e.g. consolas) in the MS Word document will be rendered to text-code: ```....```.

Agonize answered 5/5, 2013 at 9:41 Comment(4)
Looks like pandoc now supports direct conversion from .docx to .md including math formulas. Take a look here example 35.Jedjedd
Check out wordtomarkdown.com. There is a useful app in the Windows store. It does way more than Pandoc, including tables, images, and code.Conscience
wordtomarkdown.com has a ransom virus at time of writing.Mononucleosis
Pandoc is often cited, and is open source. See its page on wikipediaBhayani
C
158

Pandoc supports conversion from docx to markdown directly:

pandoc -f docx -t markdown foo.docx -o foo.markdown

Several markdown formats are supported:

-t gfm (GitHub-Flavored Markdown)  
-t markdown_mmd (MultiMarkdown)  
-t markdown (pandoc’s extended Markdown)  
-t markdown_strict (original unextended Markdown)  
-t markdown_phpextra (PHP Markdown Extra)  
-t commonmark (CommonMark Markdown)  
Cursive answered 15/10, 2015 at 13:31 Comment(9)
Tested and working on OS-X El Capitan using homebrew (brew install pandoc)Supertax
Word tables did not convert properly - just ended up plain text in MD.Corazoncorban
Any way for it to save the images?Rimini
Regarding the question about saving images out of a Word file: Save the Word document as HTML. Word places all of the document's images in a separate folder. There are options to save as either JPG or PNG...Discoloration
To save the images, add the option --extract-media=./ to the command above. It will create a folder media with all the images and they will be correctly shown in the markdown file.Steady
This answer does it recursively in a directory.Columbary
@WestCoastProjects, when I used -t gfm, the tables converted for me.Jeremiad
@Cursive @noraj noob here - where would I run that pandoc -f docx -t markdown foo.docx -o foo.markdown command? When I type into my RStudio console, I get Error: unexpected symbol in "pandoc -f docx"Micaelamicah
With PowerShell simply dir *.docx -Recurse | % {pandoc -f docx -t markdown $_ -o "$($_.BaseName).md"}Coyle
V
33

docx -> markdown

Specifically regarding the question (docx --> markdown), use the Writeage plugin for Microsoft Word. It also works the other way round markdown --> docx.

More Options

  1. Use a Conversion Tool for multi-file conversion.
  2. Use a WYSIWYG Editor for single files and superior fonts.

Which Conversion Tools?

I've tested these three: (1) Pandoc (2) Mammoth (3) w2m


Pandoc

By far the superior tool for conversions with support for a multitude of file types (see Pandoc's man page for supported file types):

pandoc -f docx -t gfm somedoc.docx -o somedoc.md

NB
  • To get pandoc to export markdown tables ('pipe_tables' in pandoc) use multimarkdown or gfm output formats.

  • If formatting to PDF, pandoc uses LaTeX templates for this so you may need to install the LaTeX package for your OS if that command does not work out of the box. Instructions at LaTeX Installation


Which WYSIWYG Editors?

For docx, use Writeage.


Maintaining Superior Fonts

If you wish to preserve unicode characters, emojis and maintain superior fonts, you'll get some milage from the editors below when using copy-and-paste operations between file formats. Note, these do not read or write natively to docx.

Programatic Equivalent

For a programatic equivalent, you might get some results by calling a different pdf-engine and their respective options but I haven't tested this. The pandoc defaults to 'pdflatex'.

pandoc --pdf-engine=
pandoc --pdf-engine-opt=STRING

Update: A4 vs US Letter

For outside the US, set the geometry variable:

pandoc -s -V geometry:a4paper -o outfile.pdf infile.md

Footnote

Its worth mentioning here - what's not obvious when discovering Markdown is that MultiMarkdown is by far the most feature rich markdown format.

MultiMarkdown supports amongst other things - metadata, table of contents, footnotes, maths, tables and YAML.

But Github's default format uses gfm which also supports tables. I use gfm for Github/GitLab and MultiMarkdown for everything else.

Vern answered 4/11, 2018 at 10:4 Comment(1)
Check out wordtomarkdown.com. There is a useful app in the Windows store. It does way more than Pandoc, including tables, images, and code.Conscience
C
12

Mammoth is best known as a Word to HTML converter but it now supports a Markdown writer module. When I last checked, Mammoth Markdown support was still in its early stages, so you may find some features are unsupported. As usual ... check the website for the latest details.

Install

To use the Javascript version ... install NodeJS and then install Mammoth:

npm install -g mammoth

Command line

Command line to convert a Word document to Markdown ...

mammoth document.docx --output-format=markdown

API

NodeJS API to convert to Markdown ...

var mammoth = require("mammoth");
mammoth.convertToMarkdown({path: "path/to/document.docx"});

Features:

Mammoth Markdown writer currently supports:

  • Lists (numbered and bulleted)
  • Links
  • Font styles such as bold, italic
  • Images

The Mammoth command line tools and API have been ported to several languages:

With NO Markdown (May 2016):

With Markdown:

Corella answered 21/5, 2016 at 4:30 Comment(2)
mammoth document.docx --output-format=markdown > document.md worked for me to generate a converted file, as it seems there is still no support for doing so directlyBren
Attention: "Markdown support is deprecated. " github.com/mwilliamson/mammoth.js#markdownLiguria
C
12

Given that you asked this question on stackoverflow you're probably wanting a programmatic or command line solution for which I've included another answer.

However, an alternative solution might be to use the Writage Markdown plugin for Microsoft Word.

Writage turns Word into your Markdown WYSIWYG editor, so you will be able to open a Markdown file and edit it like you normally edit any document in Microsoft Word. Also it will be possible to save your Word document as a Markdown file without any other converters.

Under the covers, Writage uses Pandoc that you'll also need to install for this plugin to work.

It currently supports the following Markdown elements:

  • Headings
  • Lists (numbered and bulleted)
  • Links
  • Font styles such as bold, italic
  • Tables
  • Footnotes

This might be the ideal solution for many end users as they won't need to install or run any command line tools - but rather just stick with what they are most familiar.

Corella answered 21/5, 2016 at 4:59 Comment(2)
Worth noting that Writage is Windows-only. I've emailed the author to ask about OS X.Imagination
Also worth noting that it is a paid application (as of this writing, at least).Pythagoreanism
U
8

You can use Word to Markdown (Ruby Gem) to convert it in one step. Conversion can be as simple as:

$ gem install word-to-markdown
$ w2m path/to/document.docx

It routes the document through LibreOffice, but also does it best to semantice headings based on their relative font size.

There's also a hosted version which would be as simple as drag-and-drop to convert.

Uxorious answered 12/1, 2015 at 17:12 Comment(2)
Thanks for sharing the hosted version; I like that versus installing binaries on my computer.Prisca
The hosted version seems gone :-(Bhayani
E
5

Word to Markdown might be worth a shot, or the procedure described here using Calibre and Pandoc via HTMLZ, here's a bash script they use:

#!/bin/bash
mkdir temp
cp $1 temp
cd temp
ebook-convert $1 output.htmlz
unzip output.htmlz
cd ..
pandoc -f html -t markdown -o output.md temp/index.html
rm -R temp
Eamon answered 17/11, 2014 at 9:9 Comment(5)
While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes.Unkenned
@EduardLuca Sorry, but do you want me to actually post the source codes? Oh, you mean the latter, yes, I could summarize this...Eamon
The idea behind link-only answers is that the content of those sites may change, or may be removed in the future, so it's best to include any relevant information from those sites in your answer, so that it helps people in the future, even if the links change. This relevant information can be either code, or a summary of the content of the sites you're linking to.Unkenned
@EduardLuca I'm aware of that (and have in fact flagged/downvoted many link-only-answers myself), though I was hoping linking to the tools would provide enough of a started here. There really isn't more information conveyed...Eamon
I think the code you posted helps a lot. I was referring to networkcultures.org/digitalpublishing/2013/08/30/… which contains detailed instructions, but it's good you posted the summarized bash :)Unkenned
F
4

You can convert Word documents from within MS Word to Markdown using this Visual Basic Script:

https://gist.github.com/hawkrives/2305254

Follow the instructions under "To use the code" to create a new Macro in Word.

Note: This converts the currently open Word document ato Markdown, which removes all the Word formatting (headings, lists, etc.). First save the Word document you plan to converts, and then save the document again as a new document before running the macro. This way you can always go back to the original Word document to make changes.

There are more examples of Word to markdown VB scripts here:

https://www.mediawiki.org/wiki/Microsoft_Word_Macros

Frequentative answered 1/6, 2015 at 14:8 Comment(0)
B
3

From here:

unoconv -f html test.docx
pandoc -f html -t markdown -o test.md test.html
Burgee answered 18/6, 2015 at 14:28 Comment(0)
L
1

Here's an open-source web application built in Ruby to do this exact thing: https://word2md.com

Lac answered 7/9, 2019 at 18:56 Comment(0)
P
0

If you're using Linux, try Pandoc (first convert .doc/.docx into html with LibreOffice or something and then run it).

On Windows (or if Pandoc doesn't work), you can try this website (online demo, you can download it): Markdownify

Py answered 5/5, 2013 at 9:54 Comment(1)
Markdownify spews "Strict Standards:"-PHP messages, and pandoc -f html -t markdown -s mydoc.html -o mydoc.md resulted in pure/non-restructured text (i.e. same as copy&paste to a text editor). What is your experience with these two?Agonize
T
0

For bulleted lists you can paste a list into Sublime Text and use multiselect ( tested ) or find and replace ( not tested ) to replace eg the proprietary MS Word characters with -, -- etc

This doesn't work with headings but it may be possible to use a similar technique with other elements.

Tannate answered 11/9, 2015 at 17:53 Comment(0)
U
0

For .doc Word files:

antiword -f some_file.doc

antiword's homepage: http://www.winfield.demon.nl/

Ubangi answered 16/10, 2021 at 12:32 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.