New to programming and figured out how to navigate to where I need to go using Selenium. I'd like to parse the data now but not sure where to start. Can someone hold my hand a sec and point me in the right direction?
Any help appreciated -
New to programming and figured out how to navigate to where I need to go using Selenium. I'd like to parse the data now but not sure where to start. Can someone hold my hand a sec and point me in the right direction?
Any help appreciated -
Assuming you are on the page you want to parse, Selenium stores the source HTML in the driver's page_source
attribute. You would then load the page_source
into BeautifulSoup
as follows:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('http://news.ycombinator.com')
html = driver.page_source
soup = BeautifulSoup(html)
for tag in soup.find_all('title'):
print(tag.text)
Hacker News
html
refers to the source of the page. Whenever you reach your page, your driver
object will have an attribute called page_source
, and the code above assigns that value to html
. Note that this step isn't really necessary as you could just pass driver.page_source
directly to BeautifulSoup (as root did above). –
Blandina title
tag, so in the odd case the page doesn't have one then nothing will show. Try running print soup.prettyify()
- do you see anything? –
Blandina soup.prettify()
... –
Blandina selenium.common.exceptions.WebDriverException: Message: 'geckodriver' executable needs to be in PATH
. –
Sporty As your question isn't particularly concrete, here's a simple example. To do something more useful read the BS docs. You will also find plenty of examples of selenium (and BS )usage here in SO.
from selenium import webdriver
from bs4 import BeautifulSoup
browser=webdriver.Firefox()
browser.get('http://webpage.com')
soup=BeautifulSoup(browser.page_source)
#do something useful
#prints all the links with corresponding text
for link in soup.find_all('a'):
print link.get('href',None),link.get_text()
browser=webdriver.Firefox()
defines browser
. just copy the code directly...you must have made a mistake. –
Ferwerda soup=BeautifulSoup(browser.page_source)
it's the same with chrome –
Ferwerda Are you sure you want to use Selenium? For this reasons I used PyQt4, it's very powerful, and you can do what ever you want.
I can give you a sample code, that I just wrote, just change url and you good to go:
#! /usr/bin/env python2.7
from PyQt4.QtCore import *
from PyQt4.QtGui import *
from PyQt4.QtWebKit import *
from bs4 import BeautifulSoup
import sys, signal
class Browser(QWebView):
def __init__(self):
QWebView.__init__(self)
self.loadProgress.connect(self._progress)
self.loadFinished.connect(self._loadFinished)
self.frame = self.page().currentFrame()
def _progress(self, progress):
print str(progress) + "%"
def _loadFinished(self):
print "Load Finished"
html = unicode(self.frame.toHtml()).encode('utf-8')
soup = BeautifulSoup(html)
print soup.prettify()
self.close()
if __name__ == "__main__":
app = QApplication(sys.argv)
br = Browser()
url = QUrl('http://web site that can contain javascript.com')
br.load(url)
br.show()
if signal.signal(signal.SIGINT, signal.SIG_DFL):
sys.exit(app.exec_())
app.exec_()
© 2022 - 2024 — McMap. All rights reserved.