What Worked For Me On Python 3.10 With BS4 And Unwrap
I initially liked Jesse Dhillon's answer a lot. However, I kept running into issues with the recursive calls due to recalling of the parser in BS4. I tried to change the level of recursion, but I kept running into problems with that too.
Then I looked into applying Bishwas Mishra's answer. Due to changes in BS4, I had to modify his code a bit, and I finally was able to develop a piece of code that would remove tags and maintain content.
I hope this helps some others.
from bs4 import BeautifulSoup
html = "<p>Good, <b>bad</b>, and <i>ug<b>l</b><u>y</u></i></p>"
soup = BeautifulSoup(html, "html5lib")
for c in ["html", "head", "body", "b", "i", "u"]:
while soup.find(c):
exec(f"soup.{c}.unwrap()")
print(soup)
NOTE: It is necessary to add "html", "head", and "body" to the invalid tags list, because BS4 will add those into your html text if they were not originally there, and I did not want them for my specific case.
The output I got from the above code was ...
<p>Good, bad, and ugly</p>
unicode
strings on each call. – Tees