How to avoid img size tags on markdown when converting docx to markdown?
Asked Answered
T

3

10

I'm converting docx files using pandoc 1.16.0.2 and everything works great except right after each image, the size attributes are showing as text in teh

![](./media/media/image4.png){width="3.266949912510936in"
height="2.141852580927384in"}

So it shows the image fine in the md but also the size tag as plain text right behind/after/below each image. The command I'm using is:

pandoc --extract-media ./media2 -s word.docx markdown -o exm_word2.md

I've read the manual as best I can but don’t see any flags to use to control this. Also most searches are coming up where people want to have the attributes and control them.

Any suggestions to kill the size attributes or is my markdown app (MarkdownPad2 - v-2.5.x) reading this md wrong?

Tinware answered 27/1, 2017 at 21:36 Comment(0)
S
6

There are two ways to do this: either remove all image attributes with a Lua filter or choose an output format that doesn't support attributes on images.

Output format

The easiest (and most standard-compliant) method is to convert to commonmark. However, CommonMark allows raw HTML snippets, so pandoc tries to be helpful and creates an HTML <img> element for images with attributes. We can prevent that by disabling the raw_html format extension:

pandoc --to=commonmark-raw_html ...

If you intend to publish the document on GitHub, then GitHub Flavored Markdown (gfm) is a good choice.

pandoc --to=gfm-raw_html ...

For pandoc's Markdown, we have to also disable the link_attributes extension:

pandoc --to=markdown-raw_html-link_attributes ...

This last method is the only one that works with older (pre 2.0) pandoc version; all other suggestions here require newer versions.

Lua filter

The filter is straight-forward, it simply removes all attributes from all images

function Image (img)
  img.attr = pandoc.Attr{}
  return img
end

To apply the filter, we need to save the above into a file no-img-attr.lua and pass that file to pandoc with

pandoc --lua-filter=no-img-attr.lua ...
Slipsheet answered 9/12, 2022 at 10:45 Comment(2)
For my needs, the pandoc --to=gfm-raw_html ... worked perfectly.Pave
I created the lua filter, and it works well. This should be the accepted answer. Thank you!Fbi
S
5

Use -w gfm as argument in the command line to omit the dimensional of Images.

Seraphic answered 24/7, 2019 at 12:35 Comment(2)
This flag took care of it for me.Michellemichels
Would be great if you had some docs. What you are doing, is actually changing to --write=gfm - gfm (GitHub-Flavored Markdown), or the deprecated and less accurate markdown_github; use markdown_github only if you need extensions not supported in gfm. Plus doesn't work anymore, images are converted to <img src="./images/media/image1.png" style="width:6.5in;height:3.73611in" /> html format instead of htmlFarfamed
G
4

You could write a filter to do this. You'll need to install panflute. Save this as remove_img_size.py:

import panflute as pf


def change_md_link(elem, doc):
    if isinstance(elem, pf.Image):
        elem.attributes.pop('width', None)
        elem.attributes.pop('height', None)
    return elem


if __name__ == "__main__":
    pf.run_filter(change_md_link)

Then compile with

pandoc word.docx -F remove_img_size.py -o exm_word2.md
Gingerich answered 29/1, 2017 at 16:44 Comment(3)
I keep getting "pandoc: Error running filter remove_img_size.py fd:4: hPutBuf: resource vanished (Broken pipe)" despite being able to run that filter directly in python3. It's on a docker container that may be hosed, so will rebuild and test. thank you!Tinware
@Tinware looks like a haskell error, so it might have to do with your pandoc version (I tested with pandoc 17.1)Gingerich
Error: Traceback (most recent call last): File "remove_img_size.py", line 1, in <module> import panflute as pf ModuleNotFoundError: No module named 'panflute' Error running filter remove_img_size.py: Filter returned error status 1Farfamed

© 2022 - 2024 — McMap. All rights reserved.