I am also surprised that this seemingly simple operation didn't yield any result in web search. Ended up writing the following by referring to BeautifulSoup doc and example usages from other SO answers.
The code below also removes the script
and style
html tags. On top of that, it will preserve any src
and href
attributes. These two should allows for flexibility to fit for your needs (i.e. adapt any needs then use pandoc to convert the returned html to markdown).
# https://beautiful-soup-4.readthedocs.io/en/latest/#searching-the-tree
from bs4 import BeautifulSoup, NavigableString
def unstyle_html(html):
soup = BeautifulSoup(html, features="html.parser")
# remove all attributes except for `src` and `href`
for tag in soup.descendants:
keys = []
if not isinstance(tag, NavigableString):
for k in tag.attrs.keys():
if k not in ["src", "href"]:
keys.append(k)
for k in keys:
del tag[k]
# remove all script and style tags
for tag in soup.find_all(["script", "style"]):
tag.decompose()
# return html text
return soup.prettify()
pandoc -t markdown-header_attributes-link_attributes-native_divs-native_spans
and so forth... or yes, write a pandoc filter – Fatuity