Parse the html code for a whole webpage scrolled down

Asked 22/6, 2015 at 14:13 Answered 25/6, 2015 at 18:47

Solved python selenium web-scraping beautifulsoup urllib

from bs4 import BeautifulSoup
import urllib,sys
reload(sys)
sys.setdefaultencoding("utf-8")
r = urllib.urlopen('https://twitter.com/ndtv').read()
soup = BeautifulSoup(r)

This would give me not the whole web page scrolled down the end which I want but only some of it.

EDIT:

from selenium import webdriver
from selenium.common.exceptions import StaleElementReferenceException, TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import urllib,sys,requests
reload(sys)
sys.setdefaultencoding("utf-8")

class wait_for_more_than_n_elements_to_be_present(object):
    def __init__(self, locator, count):
        self.locator = locator
        self.count = count

    def __call__(self, driver):
        try:
            elements = EC._find_elements(driver, self.locator)
            return len(elements) > self.count
        except StaleElementReferenceException:
            return False

def return_html_code(url):
    driver = webdriver.Firefox()
    driver.maximize_window()
    driver.get(url)
    # initial wait for the tweets to load
    wait = WebDriverWait(driver, 10)
    wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "li[data-item-id]")))
    # scroll down to the last tweet until there is no more tweets loaded
    while True:
        tweets = driver.find_elements_by_css_selector("li[data-item-id]")
        number_of_tweets = len(tweets)
        print number_of_tweets
        driver.execute_script("arguments[0].scrollIntoView();", tweets[-1])
        try:
            wait.until(wait_for_more_than_n_elements_to_be_present((By.CSS_SELECTOR, "li[data-item-id]"), number_of_tweets))
        except TimeoutException:
            break
    html_full_source=driver.page_source
    driver.close()
    return html_full_source


url='https://twitter.com/thecoolstacks'
#using selenium browser
html_source=return_html_code(url)
soup_selenium = BeautifulSoup(html_source)
print soup_selenium
text_tweet=[]
alltweets_selenium = soup_selenium.find_all(attrs={'data-item-type' : 'tweet'})
for tweet in alltweets_selenium:
    #Text of tweet
    html_tweet= tweet.find_all("p", class_="TweetTextSize TweetTextSize--16px js-tweet-text tweet-text")
    text_tweet.append(''.join(html_tweet[0].findAll(text=True)))    
print text_tweet

Intended Output:

import requests from bs4 import BeautifulSoup      url='https://twitter.com/thecoolstacks' 
req = requests.get(url) 
soup = BeautifulSoup(req.content) 
alltweets = soup.find_all(attrs={'data-item-type' : 'tweet'}) 
print alltweets[0]

Violaceous answered 22/6, 2015 at 14:13 Comment(16)

Why don't use the Twitter API instead? – Fanlight 22/6, 2015 at 14:21

I'm pretty sure twitter home pages are dynamically loaded as you scroll so I don't think BS is going to be able to do that. – Bootless 22/6, 2015 at 14:22

using chrome with devtools, new ajax calls processing when scrolling down – Mow 22/6, 2015 at 14:22

@Fanlight Sorry for not mentioning that. I have tried it. Reason for not using: Twitter API doesn't allow access to historical data for a search query and has limit ~3200 tweets for a particular user. – Violaceous 22/6, 2015 at 14:28

@IanAuld Any other packages or some workaround you suggest. – Violaceous 22/6, 2015 at 14:28

@Mow Sorry, I am naive to this. I am completely unaware of what you have advised.Can you please provide some code/example/link/tutorial – Violaceous 22/6, 2015 at 14:30

google chrome developer tools :) – Mow 22/6, 2015 at 14:33

@Mow Can you please elucidate more on this, are you referring to use this just for extraction of HTML code. I say this because I am inclined to use bs4 since it's has a good functionality to parse html code. – Violaceous 22/6, 2015 at 14:57

Got your point, chrome devtools is chrome-based js scripts, not using in python. – Mow 22/6, 2015 at 15:1

@Mow Thanks for information. Any workaround for python you may suggest please? – Violaceous 22/6, 2015 at 15:14

Related questions: 1. #25871406 2. #19804463 3. #20796553 – Mow 22/6, 2015 at 15:36

@Mow Thanks! But ,none of them seem to work. – Violaceous 22/6, 2015 at 20:39

A selenium webdriver does seem like a good option here; but you don't want browser.page_source, that's the source html for the page, not the html of what is currently showing – Diphase 22/6, 2015 at 21:19

@Diphase What do you suggest would be a workaround. I haven;t been able to come up with solution so far. – Violaceous 22/6, 2015 at 21:57

@AbhishekBhatia Use selenium to set up a browser, and control the scrollbar in your code – Mow 23/6, 2015 at 2:21

@Mow Thanks, for the suggestion! This seems to look like the only possible approach from stuff found on the web. But a starting code would be greatly helpful. I say this because the approaches don't work besides not rendering the full html code. – Violaceous 23/6, 2015 at 5:24

I would still insist on using the Twitter API.

Alternatively, here is how you can approach the problem with selenium:

use Explicit Waits and define a custom Expected Condition to wait for tweets to load on scroll
perform the scroll to a last loaded tweet via scrollIntoView()

Implementation:

from selenium import webdriver
from selenium.common.exceptions import StaleElementReferenceException, TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


class wait_for_more_than_n_elements_to_be_present(object):
    def __init__(self, locator, count):
        self.locator = locator
        self.count = count

    def __call__(self, driver):
        try:
            elements = EC._find_elements(driver, self.locator)
            return len(elements) > self.count
        except StaleElementReferenceException:
            return False


url = "https://twitter.com/ndtv"
driver = webdriver.Firefox()
driver.maximize_window()
driver.get(url)

# initial wait for the tweets to load
wait = WebDriverWait(driver, 10)
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "li[data-item-id]")))

# scroll down to the last tweet until there is no more tweets loaded
while True:
    tweets = driver.find_elements_by_css_selector("li[data-item-id]")
    number_of_tweets = len(tweets)

    driver.execute_script("arguments[0].scrollIntoView();", tweets[-1])

    try:
        wait.until(wait_for_more_than_n_elements_to_be_present((By.CSS_SELECTOR, "li[data-item-id]"), number_of_tweets))
    except TimeoutException:
        break

This would scroll down as much as it is needed to load all of the existing tweets in this channel.

Here is the HTML-parsing snippet, extracting tweets:

page_source = driver.page_source
driver.close()

soup = BeautifulSoup(page_source)
for tweet in soup.select("div.tweet div.content"):
    print tweet.p.text

It prints:

Father's Day Facebook post by arrested cop Suhas Gokhale's son got nearly 10,000 likes http://goo.gl/aPqlxf  pic.twitter.com/JUqmdWNQ3c
#HWL2015 End of third quarter! Breathtaking stuff. India 2-2 Pakistan - http://sports.ndtv.com/hockey/news/244463-hockey-world-league-semifinal-india-vs-pakistan-antwerp …
Why these Kashmiri boys may miss their IIT dream http://goo.gl/9LVKfK  pic.twitter.com/gohX21Gibi
...

Fanlight answered 25/6, 2015 at 18:47 Comment(12)

Great! Thanks so much. I wished to convert the html_code read using driver.page_source to soup using soup = BeautifulSoup(driver.page_source). But soup doesn't contain the full page source. Can you please advise where I am going wrong. – Violaceous 25/6, 2015 at 19:46

please check the above doubt. – Violaceous 25/6, 2015 at 20:13

@AbhishekBhatia sure, are you calling the soup = BeautifulSoup(driver.page_source) after the while loop finished the job? – Fanlight 25/6, 2015 at 20:47

@AbhishekBhatia yeah, just get the page source before closing the browser - before calling driver.close(). – Fanlight 25/6, 2015 at 22:26

Thanks again for the help! Your suggestion worked but I still face a weird issue while parsing through the html code in soup. Please check the above code. Ideally it should return me all tweets but it isn't right now. text_tweet contains all the tweets, I use simple web scrapping. – Violaceous 26/6, 2015 at 10:37

@AbhishekBhatia btw, you don't really need BeautifulSoup here. You can approach it with selenium itself - it is quite powerful in locating elements. Would you be okay with selenium-only option? – Fanlight 26/6, 2015 at 11:52

@Thanks for the continued support. I would prefer to BeautifulSoup as most of my code is in it. Why do think this happens? I tried your advice by replacing alltweets = soup.find_all(attrs={'data-item-type' : 'tweet'}) with alltweets=driver.find_elements_by_xpath("//li[@data-item-type='tweet']"). But it returns me with <class 'selenium.webdriver.remote.webelement.WebElement'> objects instead of list of html code of each elements founds. – Violaceous 26/6, 2015 at 15:18

@AbhishekBhatia okay, I've added an HTML parsing code snippet that extracts the texts of tweets. – Fanlight 26/6, 2015 at 15:23

Thanks for the help, but this code doesn't return all of them. If possible I wanted just code in the following form

import requests from bs4 import BeautifulSoup url='https://twitter.com/thecoolstacks' req = requests.get(url) soup = BeautifulSoup(req.content) alltweets = soup.find_all(attrs={'data-item-type' : 'tweet'}) print alltweets[0]

I could parse the rest. – Violaceous 26/6, 2015 at 15:57

Hi! Thanks again for the help! I wished to ask more one question. Since I am scraping data and it seems to against their rules. But I am not reselling it, and using strictly using for my research project. Should I acquire their permission in some way before doing so? If yes, can guide me a bit. – Violaceous 29/6, 2015 at 19:11

@AbhishekBhatia I would still stay on the legal side. You can contact the web-site administration and explain your use-case. – Fanlight 29/6, 2015 at 19:13

Thanks for all the help! One problem: Sometimes it shows Loading is taking too much time while scrolling down and stops. Any workaround? – Violaceous 18/1, 2016 at 4:59

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags