Find next siblings until a certain one using beautifulsoup

Asked 25/7, 2012 at 10:5 Answered 11/1, 2020 at 16:48

Solved python web-scraping beautifulsoup find siblings

The webpage is something like this:

<h2>section1</h2>
<p>article</p>
<p>article</p>
<p>article</p>

<h2>section2</h2>
<p>article</p>
<p>article</p>
<p>article</p>

How can I find each section with articles within them? That is, after finding h2, find nextsiblings

until the next h2.

If the webpage were like: (which is normally the case)

<div>
<h2>section1</h2>
<p>article</p>
<p>article</p>
<p>article</p>
</div>

<div>
<h2>section2</h2>
<p>article</p>
<p>article</p>
<p>article</p>
</div>

I can write codes like:

for section in soup.findAll('div'):
...
    for post in section.findAll('p')

But what should I do with the first webpage if I want to get the same result?

Queensland answered 25/7, 2012 at 10:5 Comment(1)

is it a wikipedia page? – Apfel 22/12, 2018 at 9:27

I think you can do something like this:

for section in soup.findAll('h2'):
    nextNode = section
    while True:
        nextNode = nextNode.nextSibling
        try:
            tag_name = nextNode.name
        except AttributeError:
            tag_name = ""
        if tag_name == "p":
            print nextNode.string
        else:
            print "*****"
            break

Given:

<h2>section1</h2>
<p>article1</p>
<p>article2</p>
<p>article3</p>

<h2>section2</h2>
<p>article4</p>
<p>article5</p>
<p>article6</p>

Output:

article1
article2
article3
*****
article4
article5
article6
*****

Pisces answered 25/7, 2012 at 11:35 Comment(4)

Thank you. This indeed separate sections, but doesn't seem to make articles belong to a certain section. I would like something that would somehow get the same result as the first example I gave. – Queensland 27/7, 2012 at 3:45

@Queensland Please check the solution again. I think this solution separates articles according to their section. – Pisces 27/7, 2012 at 4:13

I'm actually using Calibre to make a recipe for a webpage I want to download. This involves identifying the section and articles within the section (after which they are converted into an e-book). The solution you gave seems to treat section name and articles as the same. – Queensland 27/7, 2012 at 4:22

@Queensland "soup.findAll('h2')" will give you all sections which are not same as articles. While "section.nextSibling" will give you nextNode which you will check whether it is article(having <p> tag) or section. I assumed that the structure will be same as you provided(means only <h2> and <p>). You are getting sections and articles, its up to you, to treat them same or separately. I hope this clarifies your confusion. – Pisces 27/7, 2012 at 6:33

The next_siblings iterator can be helpful here as well:

for i in soup.find_all('h2'):
    for sib in i.next_siblings:
        if sib.name == 'p':
            print(sib.text)
        elif sib.name == 'h2':
            print ("*****")
            break

Antibiotic answered 11/1, 2020 at 16:48 Comment(0)

Recommended topics

Hot tags