Beautiful Soup - Get all text, but preserve link html?
Asked Answered
F

4

7

I have to process a large archive of extremely messy HTML full of extraneous tables, spans and inline styles into markdown.

I am trying to use Beautiful Soup to accomplish this task, and my goal is basically the output of the get_text() function, except to preserve anchor tags with the href intact.

As an example, I would like to convert:

<td>
    <font><span>Hello</span><span>World</span></font><br>
    <span>Foo Bar <span>Baz</span></span><br>
    <span>Example Link: <a href="https://google.com" target="_blank" style="mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;color: #395c99;font-weight: normal;text-decoration: underline;">Google</a></span>
</td>

Into:

Hello World
Foo Bar Baz
Example Link: <a href="https://google.com">Google</a>

My thought process so far was to simply grab all the tags and unwrap them all if they aren't anchors, but this causes the text to be repeated several times as soup.find_all(True) returns recursively nested tags as individual elements:

#!/usr/bin/env python

from bs4 import BeautifulSoup

example_html = '<td><font><span>Hello</span><span>World</span></font><br><span>Foo Bar <span>Baz</span></span><br><span>Example Link: <a href="https://google.com" target="_blank" style="mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;color: #395c99;font-weight: normal;text-decoration: underline;">Google</a></span></td>'

soup = BeautifulSoup(example_html, 'lxml')
tags = soup.find_all(True)

for tag in tags:
    if (tag.name == 'a'):
        print("<a href='{}'>{}</a>".format(tag['href'], tag.get_text()))
    else:
        print(tag.get_text())

Which returns multiple fragments/duplicates as the parser moves down the tree:

HelloWorldFoo Bar BazExample Link: Google
HelloWorldFoo Bar BazExample Link: Google
HelloWorldFoo Bar BazExample Link: Google
HelloWorld
Hello
World

Foo Bar Baz
Baz

Example Link: Google
<a href='https://google.com'>Google</a>
Foliate answered 26/8, 2018 at 12:30 Comment(1)
do you want to remove styles and other link attributes too?? because your input and output refers to thatFayfayal
S
7

One of the possible ways to tackle this problem would be to introduce some special handling for a elements when it comes to printing out a text of an element.

You can do it by overriding _all_strings() method and returning a string representation of an a descendant element and skip a navigable string inside an a element. Something along these lines:

from bs4 import BeautifulSoup, NavigableString, CData, Tag


class MyBeautifulSoup(BeautifulSoup):
    def _all_strings(self, strip=False, types=(NavigableString, CData)):
        for descendant in self.descendants:
            # return "a" string representation if we encounter it
            if isinstance(descendant, Tag) and descendant.name == 'a':
                yield str(descendant)

            # skip an inner text node inside "a"
            if isinstance(descendant, NavigableString) and descendant.parent.name == 'a':
                continue

            # default behavior
            if (
                (types is None and not isinstance(descendant, NavigableString))
                or
                (types is not None and type(descendant) not in types)):
                continue

            if strip:
                descendant = descendant.strip()
                if len(descendant) == 0:
                    continue
            yield descendant

Demo:

In [1]: data = """
   ...: <td>
   ...:     <font><span>Hello</span><span>World</span></font><br>
   ...:     <span>Foo Bar <span>Baz</span></span><br>
   ...:     <span>Example Link: <a href="https://google.com" target="_blank" style="mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;color: #395c99;font-weight: normal;tex
   ...: t-decoration: underline;">Google</a></span>
   ...: </td>
   ...: """

In [2]: soup = MyBeautifulSoup(data, "lxml")

In [3]: print(soup.get_text())

HelloWorld
Foo Bar Baz
Example Link: <a href="https://google.com" style="mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;color: #395c99;font-weight: normal;text-decoration: underline;" target="_blank">Google</a>
Stool answered 26/8, 2018 at 13:53 Comment(5)
Amazing, thank you, this is a flexible, elegant solution that I never would have thought of. I made a small adjustment to the handling of the a tag to output as I wanted, and it's perfect.Foliate
How could it return only the href attribute? Like: Example Link: <https://google.com>Horrocks
Answering my own question, just change: if isinstance(descendant, Tag) and descendant.name == 'a': yield str(descendant) To: if isinstance(descendant, Tag) and descendant.name == 'a': yield str('<{}> '.format(descendant.get('href', '')))Horrocks
wonder if is version updated, I run the example code with error File "/Users/xhuang9/pros/station1/t1.py", line 20, in _all_strings (types is not None and type(descendant) not in types)): TypeError: argument of type 'object' is not iterableAdulterate
I have same issue as @alextre. It will be great if someone can share an update.Carbazole
C
2

To only consider direct children set recursive = False then you need to process each 'td' and extract the text and anchor link individually.

#!/usr/bin/env python
from bs4 import BeautifulSoup

example_html = '<td><font><span>Some Example Text</span></font><br><span>Another Example Text</span><br><span>Example Link: <a href="https://google.com" target="_blank" style="mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;color: #395c99;font-weight: normal;text-decoration: underline;">Google</a></span></td>'

soup = BeautifulSoup(example_html, 'lxml')
tags = soup.find_all(recursive=False)
for tag in tags:
    print(tag.text)
    print(tag.find('a'))

If you want the text printed on separate lines you will have to process the spans individually.

for tag in tags:
    spans = tag.find_all('span')
    for span in spans:
        print(span.text)
print(tag.find('a'))
Coagulate answered 26/8, 2018 at 13:45 Comment(0)
J
1

The solution accepted is not working for me (I had the same issue as @alextre, probably due to a version changes). However, I managed to resolve it by making modifications and overriding the get_text() method instead of all_string().

from bs4 import BeautifulSoup, NavigableString, CData, Tag
class MyBeautifulSoup(BeautifulSoup):
    def get_text(self, separator='', strip=False, types=(NavigableString,)):
        text_parts = []

        for element in self.descendants:
            if isinstance(element, NavigableString):
                text_parts.append(str(element))
            elif isinstance(element, Tag):
                if element.name == 'a' and 'href' in element.attrs:
                    text_parts.append(element.get_text(separator=separator, strip=strip))
                    text_parts.append('(' + element['href'] + ')')
                elif isinstance(element, types):
                    text_parts.append(element.get_text(separator=separator, strip=strip))

        return separator.join(text_parts)```
Jovial answered 22/6, 2023 at 13:36 Comment(1)
your last elif seems to be impossible to reach, because you already checked that it's of type Tag before entering the last if/elif block?Sola
L
0

In case someone wants to avoid overriding or decorating classes... A good enough approach imho is to iterate through all the descendants (recursive) of a root element and append (for example) span elements as children of the links <a> containing the link references, before doing wathever get_text() operations. So, using OP example:

from bs4 import BeautifulSoup, Tag

example_html = '<td><font><span>Hello</span><span>World</span></font><br><span>Foo Bar <span>Baz</span></span><br><span>Example Link: <a href="https://google.com" target="_blank" style="mso-line-height-rule: exactly;-ms-text-size-adjust: 100%;-webkit-text-size-adjust: 100%;color: #395c99;font-weight: normal;text-decoration: underline;">Google</a></span></td>'

soup = BeautifulSoup(example_html, 'html.parser')

for el in soup.descendants:
    if isinstance(el, Tag):
        if el.name == 'a' and 'href' in el.attrs:
            new_span = soup.new_tag('span')
            new_span.string = '<a href="' + el.attrs['href'] + '">' + el.get_text() + '</a>'
            el.clear()  # if we want to "replace" and not just append
            el.insert(position=len(el.contents), new_child=new_span)

print(soup.get_text())  # HelloWorldFoo Bar BazExample Link: <a href="https://google.com">Google</a>

Notice that in real life you might find different kinds of links (href): #hello (anchors), javascript: (tricky stuff), /hello-world (relative urls, i.e. without protocol and domain)... So you might want to do something about it.

Lenard answered 28/8 at 8:49 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.