Find next siblings until a certain one using beautifulsoup
Asked Answered
Q

2

13

The webpage is something like this:

<h2>section1</h2>
<p>article</p>
<p>article</p>
<p>article</p>

<h2>section2</h2>
<p>article</p>
<p>article</p>
<p>article</p>

How can I find each section with articles within them? That is, after finding h2, find nextsiblings

until the next h2.

If the webpage were like: (which is normally the case)

<div>
<h2>section1</h2>
<p>article</p>
<p>article</p>
<p>article</p>
</div>

<div>
<h2>section2</h2>
<p>article</p>
<p>article</p>
<p>article</p>
</div>

I can write codes like:

for section in soup.findAll('div'):
...
    for post in section.findAll('p')

But what should I do with the first webpage if I want to get the same result?

Queensland answered 25/7, 2012 at 10:5 Comment(1)
is it a wikipedia page?Apfel
P
11

I think you can do something like this:

for section in soup.findAll('h2'):
    nextNode = section
    while True:
        nextNode = nextNode.nextSibling
        try:
            tag_name = nextNode.name
        except AttributeError:
            tag_name = ""
        if tag_name == "p":
            print nextNode.string
        else:
            print "*****"
            break

Given:

<h2>section1</h2>
<p>article1</p>
<p>article2</p>
<p>article3</p>

<h2>section2</h2>
<p>article4</p>
<p>article5</p>
<p>article6</p>

Output:

article1
article2
article3
*****
article4
article5
article6
*****
Pisces answered 25/7, 2012 at 11:35 Comment(4)
Thank you. This indeed separate sections, but doesn't seem to make articles belong to a certain section. I would like something that would somehow get the same result as the first example I gave.Queensland
@Queensland Please check the solution again. I think this solution separates articles according to their section.Pisces
I'm actually using Calibre to make a recipe for a webpage I want to download. This involves identifying the section and articles within the section (after which they are converted into an e-book). The solution you gave seems to treat section name and articles as the same.Queensland
@Queensland "soup.findAll('h2')" will give you all sections which are not same as articles. While "section.nextSibling" will give you nextNode which you will check whether it is article(having <p> tag) or section. I assumed that the structure will be same as you provided(means only <h2> and <p>). You are getting sections and articles, its up to you, to treat them same or separately. I hope this clarifies your confusion.Pisces
A
5

The next_siblings iterator can be helpful here as well:

for i in soup.find_all('h2'):
    for sib in i.next_siblings:
        if sib.name == 'p':
            print(sib.text)
        elif sib.name == 'h2':
            print ("*****")
            break
Antibiotic answered 11/1, 2020 at 16:48 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.