BS4 replace_with result is no longer in tree
Asked Answered
C

1

7

I need to replace multiple words in a html document. Atm I am doing this by calling replace_with once for each replacement. Calling replace_with twice on a NavigableString leads to a ValueError (see example below) cause the replaced element is no longer in the tree.

Minimal example

#!/usr/bin/env python3
from bs4 import BeautifulSoup
import re
def test1():
  html = \
  '''
    Identify
  '''
  soup = BeautifulSoup(html,features="html.parser")
  for txt in soup.findAll(text=True):
    if re.search('identify',txt,re.I) and txt.parent.name != 'a':
      newtext = re.sub('identify', '<a href="test.html"> test </a>', txt.lower())
      txt.replace_with(BeautifulSoup(newtext, features="html.parser"))
      txt.replace_with(BeautifulSoup(newtext, features="html.parser"))
      # I called it twice here to make the code as small as possible.
      # Usually it would be a different newtext ..
      # which was created using the replaced txt looking for a different word to replace.        

  return soup
print(test1())

Expected Result:

The txt is == newstring

Result:

ValueError: Cannot replace one element with another when the element to be replaced is not
part of the tree.

An easy solution would be just to tinker around with the newstring and only replacing all at once in the end, but I would like to understand the current phenomenon.

Conger answered 15/8, 2020 at 8:42 Comment(0)
C
5

The first txt.replace_with(...) removes NavigableString (here stored in variable txt) from the document tree (doc). This effectively sets txt.parent to None

The second txt.replace_with(...) looks at parent property, finds None (because txt is already removed from tree) and throws an ValueError.

As you said at the end of your question, one the solution can be to use .replace_with() only once:

import re
from bs4 import BeautifulSoup

def test1():
    html = \
    '''
    word1 word2 word3 word4
    '''
    soup = BeautifulSoup(html,features="html.parser")

    to_delete = []
    for txt in soup.findAll(text=True):
        if re.search('word1', txt, flags=re.I) and txt.parent.name != 'a':
            newtext = re.sub('word1', '<a href="test.html"> test1 </a>', txt.lower())
            
            # ...some computations

            newtext = re.sub('word3', '<a href="test.html"> test2 </a>', newtext)

            # ...some more computations

            # and at the end, replce txt only once:
            txt.replace_with(BeautifulSoup(newtext, features="html.parser"))

    return soup
print(test1())

Prints:

<a href="test.html"> test1 </a> word2 <a href="test.html"> test2 </a> word4
Cutaway answered 15/8, 2020 at 10:29 Comment(2)
Thank you very much! Could you explain where the replacement is then afterwards? I always imagined txt.replace_with(new) would replace the area in the tree where txt stands with new. Is it because txt doesn't refer to the position in the tree, but to the content that is now removed that it [the variable txt] doesn't refer to the new replacement?Conger
@Conger Yes, txt is content that is replaced with new content. After that txt refers to the content that is outside of the tree, it doesn't refer to the new content.Cutaway

© 2022 - 2024 — McMap. All rights reserved.