How can I parse a website using Selenium and Beautifulsoup in python? [closed]

Asked 19/12, 2012 at 20:6 Answered 19/12, 2012 at 20:19

New to programming and figured out how to navigate to where I need to go using Selenium. I'd like to parse the data now but not sure where to start. Can someone hold my hand a sec and point me in the right direction?

Any help appreciated -

Backspace answered 19/12, 2012 at 20:6 Comment(0)

165

Assuming you are on the page you want to parse, Selenium stores the source HTML in the driver's page_source attribute. You would then load the page_source into BeautifulSoup as follows:

from bs4 import BeautifulSoup

from selenium import webdriver

driver = webdriver.Firefox()

driver.get('http://news.ycombinator.com')

html = driver.page_source

soup = BeautifulSoup(html)

for tag in soup.find_all('title'):
    print(tag.text)
    
Hacker News

Blandina answered 19/12, 2012 at 20:19 Comment(8)

@root Haha, a nice holiday exchange. – Blandina 19/12, 2012 at 20:23

@Blandina - soup = BeautifulSoup(html) NameError: name 'html' is not defined This is the error I get, any suggestions – Backspace 19/12, 2012 at 21:5

@twitchaftercoffee So in the code above, html refers to the source of the page. Whenever you reach your page, your driver object will have an attribute called page_source, and the code above assigns that value to html. Note that this step isn't really necessary as you could just pass driver.page_source directly to BeautifulSoup (as root did above). – Blandina 19/12, 2012 at 21:7

@Blandina - Worked, doesn't toss up errors, but doesn't actually print anything – Backspace 19/12, 2012 at 21:15

@twitchaftercoffee So the example up there looks for a title tag, so in the odd case the page doesn't have one then nothing will show. Try running print soup.prettyify() - do you see anything? – Blandina 19/12, 2012 at 21:19

Make that soup.prettify()... – Blandina 19/12, 2012 at 22:0

@Blandina I want to do the opposite thing. I want to select a element using beautifulsoup and then perform action using chrome driver.How can I do this – Whitacre 24/1, 2017 at 11:3

selenium.common.exceptions.WebDriverException: Message: 'geckodriver' executable needs to be in PATH. – Sporty 12/4, 2020 at 1:46

As your question isn't particularly concrete, here's a simple example. To do something more useful read the BS docs. You will also find plenty of examples of selenium (and BS )usage here in SO.

from selenium import webdriver
from bs4 import BeautifulSoup

browser=webdriver.Firefox()
browser.get('http://webpage.com')

soup=BeautifulSoup(browser.page_source)

#do something useful
#prints all the links with corresponding text

for link in soup.find_all('a'):
    print link.get('href',None),link.get_text()

Ferwerda answered 19/12, 2012 at 20:18 Comment(6)

+1, didn't see this come up as I was typing :) – Blandina 19/12, 2012 at 20:20

For this, I got soup=BeautifulSoup(browser.page_source) NameError: name 'browser' is not defined – Backspace 19/12, 2012 at 20:51

the code is ok. browser=webdriver.Firefox() defines browser. just copy the code directly...you must have made a mistake. – Ferwerda 19/12, 2012 at 21:8

@Ferwerda - got it, but did not print anything. Running it outside of python by python xx.py – Backspace 19/12, 2012 at 21:12

soup=BeautifulSoup(browser.page_source) it's the same with chrome – Ferwerda 19/12, 2012 at 21:16

@Ferwerda I want to do the opposite thing. I want to select a element using beautifulsoup and then perform action using chrome driver.How can I do this – Whitacre 24/1, 2017 at 11:4

Are you sure you want to use Selenium? For this reasons I used PyQt4, it's very powerful, and you can do what ever you want.

I can give you a sample code, that I just wrote, just change url and you good to go:

#! /usr/bin/env python2.7

from PyQt4.QtCore import *
from PyQt4.QtGui import *
from PyQt4.QtWebKit import *
from bs4 import BeautifulSoup
import sys, signal

class Browser(QWebView):
    def __init__(self):
        QWebView.__init__(self)
        self.loadProgress.connect(self._progress)
        self.loadFinished.connect(self._loadFinished)
        self.frame = self.page().currentFrame()

    def _progress(self, progress):
        print str(progress) + "%"

    def _loadFinished(self):
        print "Load Finished"
        html = unicode(self.frame.toHtml()).encode('utf-8')
        soup = BeautifulSoup(html)
        print soup.prettify()
        self.close()

if __name__ == "__main__":
    app = QApplication(sys.argv)
    br = Browser()
    url = QUrl('http://web site that can contain javascript.com')
    br.load(url)
    br.show()
    if signal.signal(signal.SIGINT, signal.SIG_DFL):
        sys.exit(app.exec_())
    app.exec_()

Humic answered 19/12, 2012 at 20:14 Comment(5)

I have found PyQt4 a humongous pain to use. Depending on OP's requirements, just using BeautifulSoup is probably a lot easier. – Nicker 19/12, 2012 at 20:14

what you mean, " just using BeautifulSoup is probably a lot easier." – Humic 19/12, 2012 at 20:17

OP here, Beautiful soup allowed me to nav to the section I want to parse very easy. I'd prefer to stick with it if possible. – Backspace 19/12, 2012 at 20:48

I'd love to use pyqt4 instead of selenium - it's so much faster. but when I install it via windows binary - and try and import it and run that code, it can't find the library. Please help – Illegitimate 19/5, 2014 at 4:52

@Humic I am looking solution to port my CLI Selenium tool to GUI based, Will an embed browser control in PyQT can be accessed by Selenium? – Kermit 16/6, 2016 at 19:19

Recommended topics

Hot tags