Script Suddenly Stops Crawling Without Error or Exception

Asked 7/10, 2018 at 19:31 Answered 10/10, 2018 at 18:30

python selenium python-requests geckodriver urllib3

I'm not sure why, but my script always stops crawling once it hits page 9. There are no errors, exceptions, or warnings, so I'm kind of at a loss.

Can somebody help me out?

P.S. Here is the full script in case anybody wants to test it for themselves!

def initiate_crawl():
    def refresh_page(url):
        ff = create_webdriver_instance()
        ff.get(url)
        ff.find_element(By.XPATH, '//*[@id="FilterItemView_sortOrder_dropdown"]/div/span[2]/span/span/span/span').click()
        ff.find_element(By.XPATH, '//a[contains(text(), "Discount - High to Low")]').click()
        items = WebDriverWait(ff, 15).until(
            EC.visibility_of_all_elements_located((By.XPATH, '//div[contains(@id, "100_dealView_")]'))
        )
        print(len(items))
        for count, item in enumerate(items):
            slashed_price = item.find_elements(By.XPATH, './/span[contains(@class, "a-text-strike")]')
            active_deals = item.find_elements(By.XPATH, './/*[contains(text(), "Add to Cart")]')
            if len(slashed_price) > 0 and len(active_deals) > 0:
                product_title = item.find_element(By.ID, 'dealTitle').text
                if product_title not in already_scraped_product_titles:
                    already_scraped_product_titles.append(product_title)
                    url = ff.current_url
                    ff.quit()
                    refresh_page(url)
                    break
            if count+1 is len(items):
                try:
                    next_button = WebDriverWait(ff, 15).until(
                        EC.text_to_be_present_in_element((By.PARTIAL_LINK_TEXT, 'Next→'), 'Next→')
                    )
                    ff.find_element(By.PARTIAL_LINK_TEXT, 'Next→').click()
                    url = ff.current_url
                    ff.quit()
                    refresh_page(url)
                except Exception as error:
                    print(error)
                    ff.quit()

    refresh_page('https://www.amazon.ca/gp/goldbox/ref=gbps_ftr_s-3_4bc8_dct_10-?gb_f_c2xvdC0z=sortOrder:BY_SCORE,discountRanges:10-25%252C25-50%252C50-70%252C70-&pf_rd_p=f5836aee-0969-4c39-9720-4f0cacf64bc8&pf_rd_s=slot-3&pf_rd_t=701&pf_rd_i=gb_main&pf_rd_m=A3DWYIK6Y9EEQB&pf_rd_r=CQ7KBNXT36G95190QJB1&ie=UTF8')

initiate_crawl()

Printing the length of items invokes some strange behaviour too. Instead of it always returning 32, which would correspond to the number of items on each page, it prints 32 for the first page, 64 for the second, 96 for the third, so on and so forth. I fixed this by using //div[contains(@id, "100_dealView_")]/div[contains(@class, "dealContainer")] instead of //div[contains(@id, "100_dealView_")] as the XPath for the items variable. I'm hoping this is the reason why it runs into issues on page 9. I'm running tests right now. Update: It is now scraping page 10 and beyond, so the issue is resolved.

Stringer answered 7/10, 2018 at 19:31 Comment(12)

Did you monitored the crawling process? Are there still buttons like ‘More’ in 9th page? – Reglet 7/10, 2018 at 19:34

@Reglet everything is monitored. ive checked the xpath, everything, nothing seems to be broken – Stringer 7/10, 2018 at 19:39

Can you check with different browser versions – Protostele 7/10, 2018 at 19:52

@Protostele i'm not exactly sure how to do that? how can i do that? – Stringer 7/10, 2018 at 19:54

I couldn't get your script to run but it seems likely that at some point you're getting items of length 0 and so the enumeration loop isn't happening. Try printing the length of items before the loop and see what happens before the code ends. – Mendie 7/10, 2018 at 19:58

@AndrewMcDowell good idea! i'm at the pt where i believe it must be somewhere else in the script. i've currently set a bunch of time.sleep(n)s and running a test with that. after that i'll printing the length! thanks for the input <3 – Stringer 7/10, 2018 at 20:7

@AndrewMcDowell youd think tho that it would throw an error if items failed to find anything. i mean, it should or would throw an error. worth exploring at this pt nonetheless – Stringer 7/10, 2018 at 20:36

@AndrewMcDowell so i've noticed something super strange, instead of len(items) returning the number of items on each page, it'll return 32 for the first page, then 64 for the second page (33-64), 96 for the third (65-96), so on and so forth... how can this be happening?? wth – Stringer 7/10, 2018 at 21:15

That is strange and not the behaviour I'd expect, but it may be a side effect of the way the site is coded. That could be what's causing the problem, but if so I imagine the problem would happen on page 2 already, so I'm afraid I'm out of ideas. Good luck! – Mendie 7/10, 2018 at 22:4

@AndrewMcDowell it must have something to do with the site since i quit the instance of the webdriver and then create another instance anew... man ive tested much longer scripts on amazon and it seems like theyve done almost everything in their power to try and prevent scraping/ddosing stuff. unfortunate – Stringer 7/10, 2018 at 22:7

@AndrewMcDowell so i changed up my script a bit and duplicated the if count+1... statement on the same level as the if product_title not in statement and got the following error:

HTTPConnectionPool(host='127.0.0.1', port=58992): Max retries exceeded with url: /session/e8beed9b-4faa-4e91-a659-56761cb604d7/element (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000022D31378A58>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it'))

which is strange since i never received [cont.] – Stringer 7/10, 2018 at 23:7

@AndrewMcDowell ...an error like this before. obvs it indicates that Amazon is actively refusing my connection on that page. not sure why only the additional if statement would invoke that – Stringer 7/10, 2018 at 23:7

As per your 10^th revision of this question the error message...

HTTPConnectionPool(host='127.0.0.1', port=58992): Max retries exceeded with url: /session/e8beed9b-4faa-4e91-a659-56761cb604d7/element (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000022D31378A58>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it'))

...implies that the get() method failed raising HTTPConnectionPool error with a message Max retries exceeded.

A couple of things:

As per the discussion max-retries-exceeded exceptions are confusing the traceback is somewhat misleading. Requests wraps the exception for the users convenience. The original exception is part of the message displayed.
Requests never retries (it sets the retries=0 for urllib3's HTTPConnectionPool), so the error would have been much more canonical without the MaxRetryError and HTTPConnectionPool keywords. So an ideal Traceback would have been:
```
  NewConnectionError(<class 'socket.error'>: [Errno 10061] No connection could be made because the target machine actively refused it)
```
You will find a detailed explaination in MaxRetryError: HTTPConnectionPool: Max retries exceeded (Caused by ProtocolError('Connection aborted.', error(111, 'Connection refused')))

Solution

As per the Release Notes of Selenium 3.14.1:

* Fix ability to set timeout for urllib3 (#6286)

The Merge is: repair urllib3 can't set timeout!

Conclusion

Once you upgrade to Selenium 3.14.1 you will be able to set the timeout and see canonical Tracebacks and would be able to take required action.

References

A couple of relevent references:

This usecase

I have taken your full script from codepen.io - A PEN BY Anthony. I had to make a few tweaks to your existing code as follows:

As you have used:

  ua_string = random.choice(ua_strings)

You have to mandatorily import random as:

    import random

You have created the variable next_button but haven't used it. I have clubbed up the following four lines:

  next_button = WebDriverWait(ff, 15).until(
                  EC.text_to_be_present_in_element((By.PARTIAL_LINK_TEXT, 'Next→'), 'Next→')
              )
  ff.find_element(By.PARTIAL_LINK_TEXT, 'Next→').click()

As:

  WebDriverWait(ff, 15).until(EC.text_to_be_present_in_element((By.PARTIAL_LINK_TEXT, 'Next→'), 'Next→'))
  ff.find_element(By.PARTIAL_LINK_TEXT, 'Next→').click()

Your modified code block will be:

  # -*- coding: utf-8 -*-
  from selenium import webdriver
  from selenium.webdriver.firefox.options import Options
  from selenium.webdriver.common.by import By
  from selenium.webdriver.support import expected_conditions as EC
  from selenium.webdriver.support.ui import WebDriverWait
  import time
  import random


  """ Set Global Variables
  """
  ua_strings = ['Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36']
  already_scraped_product_titles = []



  """ Create Instances of WebDriver
  """
  def create_webdriver_instance():
      ua_string = random.choice(ua_strings)
      profile = webdriver.FirefoxProfile()
      profile.set_preference('general.useragent.override', ua_string)
      options = Options()
      options.add_argument('--headless')
      return webdriver.Firefox(profile)



  """ Construct List of UA Strings
  """
  def fetch_ua_strings():
      ff = create_webdriver_instance()
      ff.get('https://techblog.willshouse.com/2012/01/03/most-common-user-agents/')
      ua_strings_ff_eles = ff.find_elements_by_xpath('//td[@class="useragent"]')
      for ua_string in ua_strings_ff_eles:
          if 'mobile' not in ua_string.text and 'Trident' not in ua_string.text:
              ua_strings.append(ua_string.text)
      ff.quit()



  """ Log in to Amazon to Use SiteStripe in order to Generate Affiliate Links
  """
  def log_in(ff):
      ff.find_element(By.XPATH, '//a[@id="nav-link-yourAccount"] | //a[@id="nav-link-accountList"]').click()
      ff.find_element(By.ID, 'ap_email').send_keys('[email protected]')
      ff.find_element(By.ID, 'continue').click()
      ff.find_element(By.ID, 'ap_password').send_keys('lo0kyLoOkYig0t4h')
      ff.find_element(By.NAME, 'rememberMe').click()
      ff.find_element(By.ID, 'signInSubmit').click()



  """ Build Lists of Product Page URLs
  """
  def initiate_crawl():
      def refresh_page(url):
      ff = create_webdriver_instance()
      ff.get(url)
      ff.find_element(By.XPATH, '//*[@id="FilterItemView_sortOrder_dropdown"]/div/span[2]/span/span/span/span').click()
      ff.find_element(By.XPATH, '//a[contains(text(), "Discount - High to Low")]').click()
      items = WebDriverWait(ff, 15).until(
          EC.visibility_of_all_elements_located((By.XPATH, '//div[contains(@id, "100_dealView_")]'))
      )
      for count, item in enumerate(items):
          slashed_price = item.find_elements(By.XPATH, './/span[contains(@class, "a-text-strike")]')
          active_deals = item.find_elements(By.XPATH, './/*[contains(text(), "Add to Cart")]')
          # For Groups of Items on Sale
          # active_deals = //*[contains(text(), "Add to Cart") or contains(text(), "View Deal")]
          if len(slashed_price) > 0 and len(active_deals) > 0:
              product_title = item.find_element(By.ID, 'dealTitle').text
              if product_title not in already_scraped_product_titles:
                  already_scraped_product_titles.append(product_title)
                  url = ff.current_url
                  # Scrape Details of Each Deal
                  #extract(ff, item.find_element(By.ID, 'dealImage').get_attribute('href'))
                  print(product_title[:10])
                  ff.quit()
                  refresh_page(url)
                  break
          if count+1 is len(items):
              try:
                  print('')
                  print('new page')
                  WebDriverWait(ff, 15).until(EC.text_to_be_present_in_element((By.PARTIAL_LINK_TEXT, 'Next→'), 'Next→'))
                  ff.find_element(By.PARTIAL_LINK_TEXT, 'Next→').click()
                  time.sleep(10)
                  url = ff.current_url
                  print(url)
                  print('')
                  ff.quit()
                  refresh_page(url)
              except Exception as error:
                  """
                  ff.find_element(By.XPATH, '//*[@id="pagination-both-004143081429407891"]/ul/li[9]/a').click()
                  url = ff.current_url
                  ff.quit()
                  refresh_page(url)
                  """
                  print('cannot find ff.find_element(By.PARTIAL_LINK_TEXT, "Next?")')
                  print('Because of... {}'.format(error))
                  ff.quit()

      refresh_page('https://www.amazon.ca/gp/goldbox/ref=gbps_ftr_s-3_4bc8_dct_10-?gb_f_c2xvdC0z=sortOrder:BY_SCORE,discountRanges:10-25%252C25-50%252C50-70%252C70-&pf_rd_p=f5836aee-0969-4c39-9720-4f0cacf64bc8&pf_rd_s=slot-3&pf_rd_t=701&pf_rd_i=gb_main&pf_rd_m=A3DWYIK6Y9EEQB&pf_rd_r=CQ7KBNXT36G95190QJB1&ie=UTF8')

  #def extract_info(ff, url):
  fetch_ua_strings()
  initiate_crawl()

Console Output: With Selenium v3.14.0 and Firefox Quantum v62.0.3, I can extract the following output on the console:

  J.Rosée Si
  B.Catcher 
  Bluetooth4
  FRAM G4164
  Major Crim
  20% off Oh
  True Blood
  Prime-Line
  Marathon 3
  True Blood
  B.Catcher 
  4 Film Fav
  True Blood
  Texture Pa
  Westinghou
  True Blood
  ThermoPro 
  ...
  ...
  ...

Note: I could have optimized your code and performed the same web scraping operations initializing the Firefox Browser Client only once and traverse through various products and their details. But to preserve your logic and innovation I have suggested the minimal changes required to get you through.

Thrown answered 10/10, 2018 at 10:23 Comment(4)

That HTTPConnectionPool error in my question is an outlier. The script works perfectly fine until it suddenly stops on page 9 without an error or exception. The only reason I set but didn't use next_button is because I was trying to troubleshoot this, thought that might've had something to do with it, and never reset it. The question/issue here is why does it stop crawling/scraping once it finishes page 9?? – Stringer 10/10, 2018 at 15:37

oh maybe that was why i never reset it. WebDriverWait(ff, 15).until(EC.text_to_be_present_in_element((By.PARTIAL_LINK_TEXT, 'Next→'), 'Next→')).click() returns a bool object which isn't clickable. you have an error in your code – Stringer 10/10, 2018 at 15:53

@Anthony As per your observation I have made some small modifications in my solution. I could have optimized your code and performed the same web scrapping opening the Firefox Browser Client only once and traversing through various products. But to preserve your logic and innovation I have suggested the minimal changes required to get you through. Can you please try the updated solution and let me know the status please? – Thrown 11/10, 2018 at 14:34

i actually fixed the issue on my own a number of days ago. i had updated my question and just updated it again for the sake of clarity. yeah, there is a specific reason why i'm not traversing the website using a single instance, but instead creating multiple instances. though now that i think about it, creating new instances may not even be necessary with ff.back() functionality, but it's still certainly a lot more straight forward. i will give your answer a read when i get some free time! thanks for trying to solve my issue <3 – Stringer 12/10, 2018 at 2:13

I slightly adjusted the code and it seems to work. Changes:

import random statement because it is used and would not run without it.

Inside product_title loop these lines are removed:

ff.quit(), refresh_page(url) and break

The ff.quit() statement would cause a fatal (connection) error causing the script to break.

Also is changed to == for if count + 1 == len(item):

# -*- coding: utf-8 -*-
from selenium import webdriver
from selenium.webdriver import Firefox
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
import time
import random



""" Set Global Variables
"""
ua_strings = ['Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36']
already_scraped_product_titles = []



""" Create Instances of WebDriver
"""
def create_webdriver_instance():
    ua_string = random.choice(ua_strings)
    profile = webdriver.FirefoxProfile()
    profile.set_preference('general.useragent.override', ua_string)
    options = Options()
    options.add_argument('--headless')
    return webdriver.Firefox(profile)

""" Construct List of UA Strings
"""
def fetch_ua_strings():
    ff = create_webdriver_instance()
    ff.get('https://techblog.willshouse.com/2012/01/03/most-common-user-agents/')
    ua_strings_ff_eles = ff.find_elements_by_xpath('//td[@class="useragent"]')
    for ua_string in ua_strings_ff_eles:
        if 'mobile' not in ua_string.text and 'Trident' not in ua_string.text:
            ua_strings.append(ua_string.text)
    ff.quit()

""" Build Lists of Product Page URLs
"""
def initiate_crawl():
    def refresh_page(url):
        ff = create_webdriver_instance()
        ff.get(url)
        ff.find_element(By.XPATH, '//*[@id="FilterItemView_sortOrder_dropdown"]/div/span[2]/span/span/span/span').click()
        ff.find_element(By.XPATH, '//a[contains(text(), "Discount - High to Low")]').click()
        items = WebDriverWait(ff, 15).until(
            EC.visibility_of_all_elements_located((By.XPATH, '//div[contains(@id, "100_dealView_")]'))
        )
        print(items)
        for count, item in enumerate(items):
            slashed_price = item.find_elements(By.XPATH, './/span[contains(@class, "a-text-strike")]')
            active_deals = item.find_elements(By.XPATH, './/*[contains(text(), "Add to Cart")]')
            # For Groups of Items on Sale
            # active_deals = //*[contains(text(), "Add to Cart") or contains(text(), "View Deal")]
            if len(slashed_price) > 0 and len(active_deals) > 0:
                product_title = item.find_element(By.ID, 'dealTitle').text
                if product_title not in already_scraped_product_titles:
                    already_scraped_product_titles.append(product_title)
                    url = ff.current_url
                    # Scrape Details of Each Deal
                    #extract(ff, item.find_element(By.ID, 'dealImage').get_attribute('href'))
                    print(product_title[:10])
                    # This ff.quit()-line breaks connection which breaks things.:
                    #ff.quit()
                    # And why 
                    #refresh_page(url)
                    #break
            # 'is' tests for object equality; == tests for value equality:
            if count+1 == len(items):
                try:
                    print('')
                    print('new page')
                    next_button = WebDriverWait(ff, 15).until(
                        EC.text_to_be_present_in_element((By.PARTIAL_LINK_TEXT, 'Next→'), 'Next→')
                    )
                    ff.find_element(By.PARTIAL_LINK_TEXT, 'Next→').click()                    
                    time.sleep(3)
                    url = ff.current_url
                    print(url)
                    print('')
                    ff.quit()
                    refresh_page(url)
                except Exception as error:
                    """
                    ff.find_element(By.XPATH, '//*[@id="pagination-both-004143081429407891"]/ul/li[9]/a').click()
                    url = ff.current_url
                    ff.quit()
                    refresh_page(url)
                    """
                    print('cannot find ff.find_element(By.PARTIAL_LINK_TEXT, "Next→")')
                    print('Because of... {}'.format(error))
                    ff.quit()

    refresh_page('https://www.amazon.ca/gp/goldbox/ref=gbps_ftr_s-3_4bc8_dct_10-?gb_f_c2xvdC0z=sortOrder:BY_SCORE,discountRanges:10-25%252C25-50%252C50-70%252C70-&pf_rd_p=f5836aee-0969-4c39-9720-4f0cacf64bc8&pf_rd_s=slot-3&pf_rd_t=701&pf_rd_i=gb_main&pf_rd_m=A3DWYIK6Y9EEQB&pf_rd_r=CQ7KBNXT36G95190QJB1&ie=UTF8')

#def extract_info(ff, url):
fetch_ua_strings()
initiate_crawl()

Exhilarate answered 10/10, 2018 at 18:30 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Solution

Conclusion

References

This usecase

Recommended topics

Hot tags