How to use python urlopen scraping after a page finish loading all searching result?

import urllib.parse import urllib.request from bs4 import BeautifulSoup url = 'http://flight.qunar.com/site/oneway_list.htm' values = {'searchDepartureAirport':'北京', 'searchArrivalAirport':'丽江', 'searchDepartureTime':'2012-07-25'} encoded_param = urllib.parse.urlencode(values) full_url = url + '?' + encoded_param response = urllib.request.urlopen(full_url) soup = BeautifulSoup(response) print(soup.prettify())

The problem is actually quite hard - the site uses dynamically generated content that gets loaded via JavaScript, however urllib gets basically only what you would get in a browser if you disabled JavaScript. So, what can we do?

Use

to fully render a webpage (they are essentially headless, automated browsers for testing and scraping)

Or, if you want a (semi-)pure Python solution, use PyQt4.QtWebKit to render the page. It works approxiametly like this:

import sys
import signal

from optparse import OptionParser
from PyQt4.QtCore import *
from PyQt4.QtGui import *
from PyQt4.QtWebKit import QWebPage

url = "http://www.stackoverflow.com"

def page_to_file(page):
    with open("output", 'w') as f:
        f.write(page.mainFrame().toHtml())
        f.close()

app = QApplication()
page = QWebPage()
signal.signal( signal.SIGINT, signal.SIG_DFL )
page.connect(page, SIGNAL( 'loadFinished(bool)' ), page_to_file)
page.mainFrame().load(QUrl(url))
sys.exit( app.exec_() )

Edit: There's a nice explanation how this works here.

Ps: You may want to look into requests instead of using urllib :)

Recommended topics

Hot tags