BeautifulSoup: Strip specified attributes, but preserve the tag and its contents
Asked Answered
C

5

13

I'm trying to 'defrontpagify' the html of a MS FrontPage generated website, and I'm writing a BeautifulSoup script to do it.

However, I've gotten stuck on the part where I try to strip a particular attribute (or list attributes) from every tag in the document that contains them. The code snippet:

REMOVE_ATTRIBUTES = ['lang','language','onmouseover','onmouseout','script','style','font',
                        'dir','face','size','color','style','class','width','height','hspace',
                        'border','valign','align','background','bgcolor','text','link','vlink',
                        'alink','cellpadding','cellspacing']

# remove all attributes in REMOVE_ATTRIBUTES from all tags, 
# but preserve the tag and its content. 
for attribute in REMOVE_ATTRIBUTES:
    for tag in soup.findAll(attribute=True):
        del(tag[attribute])

It runs without error, but doesn't actually strip any of the attributes. When I run it without the outer loop, just hard coding a single attribute (soup.findAll('style'=True), it works.

Anyone see know the problem here?

PS - I don't much like the nested loops either. If anyone knows a more functional, map/filter-ish style, I'd love to see it.

Continent answered 28/1, 2012 at 9:3 Comment(2)
For me, it works if soup.findAll(attribute=True) is changed to simply soup.findAll().Cronyism
Nice catch, that does indeed work. Pretty obvious in hindsight, don't need to check the attribute value twice. Only problem is it checks all the attributes of every tag in the doc, and takes twice as long to run, but 5s vs 2.5s for ~15 pages isn't a big deal here.Continent
R
12

The line

for tag in soup.findAll(attribute=True):

does not find any tags. There might be a way to use findAll, I'm not sure.

However, this works (as of beautifulsoup 4.8.1):

import bs4
REMOVE_ATTRIBUTES = [
    'lang','language','onmouseover','onmouseout','script','style','font',
    'dir','face','size','color','style','class','width','height','hspace',
    'border','valign','align','background','bgcolor','text','link','vlink',
    'alink','cellpadding','cellspacing']

doc = '''<html><head><title>Page title</title></head><body><p id="firstpara" align="center">This is <i>paragraph</i> <a onmouseout="">one</a>.<p id="secondpara" align="blah">This is <i>paragraph</i> <b>two</b>.</html>'''
soup = bs4.BeautifulSoup(doc)
for tag in soup.descendants:
    if isinstance(tag, bs4.element.Tag):
        tag.attrs = {key: value for key, value in tag.attrs.items()
                     if key not in REMOVE_ATTRIBUTES}
print(soup.prettify())

This is previous code that may have worked with an older version of beautifulsoup:

import BeautifulSoup
REMOVE_ATTRIBUTES = [
    'lang','language','onmouseover','onmouseout','script','style','font',
    'dir','face','size','color','style','class','width','height','hspace',
    'border','valign','align','background','bgcolor','text','link','vlink',
    'alink','cellpadding','cellspacing']

doc = '''<html><head><title>Page title</title></head><body><p id="firstpara" align="center">This is <i>paragraph</i> <a onmouseout="">one</a>.<p id="secondpara" align="blah">This is <i>paragraph</i> <b>two</b>.</html>'''
soup = BeautifulSoup.BeautifulSoup(doc)
for tag in soup.recursiveChildGenerator():
    try:
        tag.attrs = [(key,value) for key,value in tag.attrs
                     if key not in REMOVE_ATTRIBUTES]
    except AttributeError: 
        # 'NavigableString' object has no attribute 'attrs'
        pass
print(soup.prettify())

Note this this code will only work in Python 3. If you need it to work in Python 2, see Nóra's answer below.

Repulsive answered 28/1, 2012 at 13:48 Comment(2)
Good enough, thanks! As for findAll, I'm sure I'm just referencing the attribute variable wrong somehow, since hardcoding the attribute name in its place does work. Will dig into that more on the next pass, after I get the whole script working.Continent
Apparently recursiveChildGenerator is deprecated: beautiful-soup-4.readthedocs.io/en/latest/#generators. The docs say descendants. Also, it looks like attributes are now a dictionary: beautiful-soup-4.readthedocs.io/en/latest/#attributes.Alithea
M
7

Here's a Python 2 version of unutbu's answer:

REMOVE_ATTRIBUTES = ['lang','language','onmouseover']

doc = '''<html><head><title>Page title</title></head><body><p id="firstpara" align="center">This is <i>paragraph</i> <a onmouseout="">one</a>.<p id="secondpara" align="blah">This is <i>paragraph</i> <b>two</b>.</html>'''

soup = BeautifulSoup.BeautifulSoup(doc)

for tag in soup.recursiveChildGenerator():
    if hasattr(tag, 'attrs'):
        tag.attrs = {key:value for key,value in tag.attrs.iteritems()
                    if key not in REMOVE_ATTRIBUTES}
Manikin answered 11/10, 2016 at 11:16 Comment(0)
A
6

Just ftr: the problem here is that if you pass HTML attributes as keyword arguments, the keyword is the name of the attribute. So your code is searching for tags with an attribute of name attribute, as the variable does not get expanded.

This is why

  1. hard-coding your attribute name worked[0]
  2. the code does not fail. The search just doesn't match any tags

To fix the problem, pass the attribute you are looking for as a dict:

for attribute in REMOVE_ATTRIBUTES:
    for tag in soup.find_all(attrs={attribute: True}):
        del tag[attribute]

Hth someone in the future, dtk

[0]: Although it needs to be find_all(style=True) in your example, without the quotes, because SyntaxError: keyword can't be an expression

Alexei answered 13/7, 2018 at 12:4 Comment(0)
T
4

I use this one:

if "align" in div.attrs:
    del div.attrs["align"]

or

if "align" in div.attrs:
    div.attrs.pop("align")

Thanks to https://mcmap.net/q/912896/-removing-style-from-specific-tags-beautifulsoup-python

Transeunt answered 16/11, 2018 at 15:3 Comment(0)
G
2

I use this method to remove a list of attributes, very compact :

attributes_to_del = ["style", "border", "rowspan", "colspan", "width", "height", 
                     "align", "valign", "color", "bgcolor", "cellspacing", 
                     "cellpadding", "onclick", "alt", "title"]
for attr_del in attributes_to_del: 
    [s.attrs.pop(attr_del) for s in soup.find_all() if attr_del in s.attrs]


Gambier answered 16/5, 2020 at 17:58 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.