Python web scraping - how to get resources with beautiful soup when page loads contents via JS?
Asked Answered
C

3

6

So I am trying to scrape a table from a specific website using BeautifulSoup and urllib. My goal is to create a single list from all the data in this table. I have tried using this same code using tables from other websites, and it works fine. However, while trying it with this website the table returns a NoneType object. Can someone help me with this? I've tried looking for other answers online but I'm not having much luck.

Here's the code:

import requests
import urllib

from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib.request.urlopen("http://www.teamrankings.com/ncaa-basketball/stat/free-throw-pct").read())

table = soup.find("table", attrs={'class':'sortable'})

data = []
rows = table.findAll("tr")
for tr in rows:
    cols = tr.findAll("td")
    for td in cols:
        text = ''.join(td.find(text=True))
        data.append(text)

print(data)
Chelyuskin answered 20/4, 2015 at 16:47 Comment(3)
Have you looked at the html for this page? There is no table...Otho
If you right click on the table and hit "inspect element" it shows the html w/ the table. If you right click anywhere else on the page it won't show it.Chelyuskin
You have to make a full browser request with Selinium to have the content that is generated via AJAX/JSWeixel
C
4

It looks like this data is loaded via an ajax call:

enter image description here

You should target that url instead: http://www.teamrankings.com/ajax/league/v3/stats_controller.php

import requests
import urllib

from bs4 import BeautifulSoup


params = {
    "type":"team-detail",
    "league":"ncb",
    "stat_id":"3083",
    "season_id":"312",
    "cat_type":"2",
    "view":"stats_v1",
    "is_previous":"0",
    "date":"04/06/2015"
}

content = urllib.request.urlopen("http://www.teamrankings.com/ajax/league/v3/stats_controller.php",data=urllib.parse.urlencode(params).encode('utf8')).read()
soup = BeautifulSoup(content)

table = soup.find("table", attrs={'class':'sortable'})

data = []
rows = table.findAll("tr")
for tr in rows:
    cols = tr.findAll("td")
    for td in cols:
        text = ''.join(td.find(text=True))
        data.append(text)

print(data)

Using your web inspector you can also view the parameters that are passed along with the POST request.

enter image description here

Generally the server on the other end will check for these values and reject your request if you do not have some or all of them. The above code snippet ran fine for me. I switched to urllib2 because I generally prefer to use that library.

If the data loads in your browser it is possible to scrape it. You just need to mimic the request your browser sends.

Colwell answered 20/4, 2015 at 16:56 Comment(8)
Note that you need to POST to this url, with the set of parameters used by their codeValladolid
@FarmerJoe Thanks, but unfortunately I can't use urllib2 because I'm working with Python3.4. Can I do it without urllib2? (urlencode is not an attribute for urllib so I'm not sure what to use)Chelyuskin
@Chelyuskin I changed the code, should run on python3 now.Colwell
@Chelyuskin I am unable to test on python3 right now, did it run alright?Colwell
@FarmerJoe I just tested it, and unfortunately no - threw the following error: "POST data should be bytes or an iterable of bytes. It cannot be of type str."Chelyuskin
@QwErTy99: Have you considered using the requests module?Valladolid
urllib.parse.urlencode(params).encode('utf8') should fix that crashValladolid
@Valladolid Success! Works like I wanted it to. Thanks!Chelyuskin
V
4

The table on that website is being created via javascript, and so does not exist when you simply throw the source code at BeautifulSoup.

Either you need to start poking around with your web inspector of choice, and find out where the javascript is getting the data from - or you should use something like selenium to run a complete browser instance.

Valladolid answered 20/4, 2015 at 16:55 Comment(3)
How would I go about getting the table if it's coded in Javascript? Sorry for seeming naive, but I'm relatively new to coding.Chelyuskin
this is correct, when you request for the resources for this page via beautiful soup; you only get the boilerplate page. Then the code on that resource, later fetches the data via javascript to an api.Blackamoor
@Chelyuskin you need a headless browser for that. Check out jeanphix.me/Ghost.pyAsparagine
C
4

It looks like this data is loaded via an ajax call:

enter image description here

You should target that url instead: http://www.teamrankings.com/ajax/league/v3/stats_controller.php

import requests
import urllib

from bs4 import BeautifulSoup


params = {
    "type":"team-detail",
    "league":"ncb",
    "stat_id":"3083",
    "season_id":"312",
    "cat_type":"2",
    "view":"stats_v1",
    "is_previous":"0",
    "date":"04/06/2015"
}

content = urllib.request.urlopen("http://www.teamrankings.com/ajax/league/v3/stats_controller.php",data=urllib.parse.urlencode(params).encode('utf8')).read()
soup = BeautifulSoup(content)

table = soup.find("table", attrs={'class':'sortable'})

data = []
rows = table.findAll("tr")
for tr in rows:
    cols = tr.findAll("td")
    for td in cols:
        text = ''.join(td.find(text=True))
        data.append(text)

print(data)

Using your web inspector you can also view the parameters that are passed along with the POST request.

enter image description here

Generally the server on the other end will check for these values and reject your request if you do not have some or all of them. The above code snippet ran fine for me. I switched to urllib2 because I generally prefer to use that library.

If the data loads in your browser it is possible to scrape it. You just need to mimic the request your browser sends.

Colwell answered 20/4, 2015 at 16:56 Comment(8)
Note that you need to POST to this url, with the set of parameters used by their codeValladolid
@FarmerJoe Thanks, but unfortunately I can't use urllib2 because I'm working with Python3.4. Can I do it without urllib2? (urlencode is not an attribute for urllib so I'm not sure what to use)Chelyuskin
@Chelyuskin I changed the code, should run on python3 now.Colwell
@Chelyuskin I am unable to test on python3 right now, did it run alright?Colwell
@FarmerJoe I just tested it, and unfortunately no - threw the following error: "POST data should be bytes or an iterable of bytes. It cannot be of type str."Chelyuskin
@QwErTy99: Have you considered using the requests module?Valladolid
urllib.parse.urlencode(params).encode('utf8') should fix that crashValladolid
@Valladolid Success! Works like I wanted it to. Thanks!Chelyuskin
H
0

Since the table data is loaded dynamically, there be some lag is updating the table data due multiple reason like network delay. So you can wait for time by giving a delay and reading the data. Check if table data i.e. length is null, if so read the table data after some delay. This will help .

Looked at the url you have used. Since you are using class selector for the table. make sure that it is present other places in the HTML

Hardball answered 20/4, 2015 at 17:3 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.