Pandoc - HTML to Markdown - remove all attributes
Asked Answered
B

3

15

This would seem like a simple thing to do, but I've been unable to find an answer. I'm converting from HTML to Markdown using Pandoc and I would like to strip all attributes from the HTML such as "class" and "id".

Is there an option in Pandoc to do this?

Bouzoun answered 6/2, 2017 at 14:52 Comment(2)
You can write a Pandoc filter to do that. If you use panflute , within a filter, do something like elem.identifier = '' , elem.classes = [], elem.attributes = {} . Since only a few elements have attributes, you can wrap it in a try clause (or use slots to find out if the elements have attributes).Jaella
You can try disabling extensions pandoc -t markdown-header_attributes-link_attributes-native_divs-native_spans and so forth... or yes, write a pandoc filterFatuity
F
16

Consider input.html:

<h1 class="test">Hi!</h1>
<p><strong id="another">This is a test.</strong></p>

Then, pandoc input.html -t gfm-raw_html -o output.md

produces output.md:

# Hi!

**This is a test.**

without the -t gfm-raw_html, you would get

# Hi! {#hi .test}

**This is a test.**

This question is actually similar to this one. I don't think pandoc ever preserves id attributes.

Fawcett answered 18/7, 2020 at 1:21 Comment(0)
M
6

You can use a Lua filter to remove all attributes and classes. Save the following to a file remove-attr.lua and call pandoc with --lua-filter=remove-attr.lua.

function remove_attr (x)
  if x.attr then
    x.attr = pandoc.Attr()
    return x
  end
end

return {{Inline = remove_attr, Block = remove_attr}}
Massasauga answered 19/2, 2021 at 12:57 Comment(1)
Trying to discard artifacts from HTML that originated from a Microsoft Office product, containing <span style="font-family:&quot;Arial&quot;,sans-serif;color:black">value</span>s around the values of every cell in a table. This worked as well as -t gfm-raw_html. Thanks!Lottie
H
0

I am also surprised that this seemingly simple operation didn't yield any result in web search. Ended up writing the following by referring to BeautifulSoup doc and example usages from other SO answers.

The code below also removes the script and style html tags. On top of that, it will preserve any src and href attributes. These two should allows for flexibility to fit for your needs (i.e. adapt any needs then use pandoc to convert the returned html to markdown).

# https://beautiful-soup-4.readthedocs.io/en/latest/#searching-the-tree
from bs4 import BeautifulSoup, NavigableString

def unstyle_html(html):
    soup = BeautifulSoup(html, features="html.parser")

    # remove all attributes except for `src` and `href`
    for tag in soup.descendants:
        keys = []
        if not isinstance(tag, NavigableString):
            for k in tag.attrs.keys():
                if k not in ["src", "href"]:
                    keys.append(k)
            for k in keys:
                del tag[k]

    # remove all script and style tags
    for tag in soup.find_all(["script", "style"]):
        tag.decompose()

    # return html text
    return soup.prettify()
Heilbronn answered 18/2, 2021 at 22:3 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.