Selenium + Flask/Falcon in Python - 502 Bad Gateway Error
Asked Answered
B

1

7

I'm using selenium to make a headless scraping of a website within an endpoint of an API using Flask for Python. I made several tests and my selenium scraping code works perfectly within a script and while running as an API in the localhost. However, when I deploy the code in a remote server, the requests always return a 502 Bad Gateway error. It is weird because by logging I can see that the scraping is working correctly, but the server responds with 502 before the scraping finish processing, as if it was trying to set up a proxy and it fails. I also noticed that removing the time.sleep in my code makes it return a 200 although the result could be wrong because it doesn't give selenium the proper time to load the all the page to scrape.

I also tried to set up to use falcon instead of flask and I get a similar error. This is a sample of my recent code using Falcon:

class GetUrl(object):

    def on_get(self, req, resp):
        """
        Get Request
        :param req:
        :param resp:
        :return:
        """

        # read parameter
        req_body = req.bounded_stream.read()
        json_data = json.loads(req_body.decode('utf8'))
        url = json_data.get("url")

        # get the url
        options = Options()
        options.add_argument("--headless")
        driver = webdriver.Firefox(firefox_options=options)

        driver.get(url)
        time.sleep(5)
        result = False

        # check for outbound links
        content = driver.find_elements_by_xpath("//a[@class='_52c6']")
        if len(content) > 0:
            href = content[0].get_attribute("href")
            result = True

        driver.quit()

        # make the return
        return_doc = {"result": result}
        resp.body = json.dumps(return_doc, sort_keys=True, indent=2)
        resp.content_type = 'text/string'
        resp.append_header('Access-Control-Allow-Origin', "*")
        resp.status = falcon.HTTP_200

I saw some other similar issues like this, but even though I can see that there is a gunicorn running in my server, I don't have nginx, or at least it is not running where it should running. And I don't think Falcon uses it. So, what exactly am I doing wrong? Some light in this issue is highly appreciated, thank you!

Boniface answered 3/9, 2021 at 2:50 Comment(2)
what's the timeout on the server set to?Hanes
Try to apply WebDriverWait rather than time.sleep. It will wait only as long as required, which will likely speed things up.Gandhi
F
1

You're missing a few imports:

from IPython.display import clear_output
import time as time
import json
!apt-get update
!apt install chromium-chromedriver
!which chromedriver
!pip install selenium
import selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.expected_conditions import presence_of_element_located
!pip install page_objects
import page_objects
from page_objects import PageObject, PageElement
time.sleep(1)
clear_output()

class GetUrl(object):

    def on_get(self, req, resp):
        """
        Get Request
        :param req:
        :param resp:
        :return:
        """

        # read parameter
        req_body = req.bounded_stream.read()
        json_data = json.loads(req_body.decode('utf8'))
        url = json_data.get("https://mcmap.net/q/1563927/-selenium-flask-falcon-in-python-502-bad-gateway-error/69546175#69546175")

        # get the url
        options = webdriver.ChromeOptions()
        options.add_argument('--headless')
        options.add_argument('--no-sandbox')
        options.add_argument('--disable-dev-shm-usage')
        driver = webdriver.Chrome('chromedriver',options = options)
        driver.implicitly_wait(3)

        driver.get("https://mcmap.net/q/1563927/-selenium-flask-falcon-in-python-502-bad-gateway-error/69546175#69546175")
        result = False

        # check for outbound links
        contentStorage = []
        content = driver.find_elements_by_tag_name('a')
        for i in content:
            contentStorage.append(i.get_attribute('text'))
            result = True

        #driver.quit()

        # make the return
        return_doc = {"result": result}
        resp.body = json.dumps(return_doc, sort_keys=True, indent=2)
        resp.content_type = 'text/string'
        resp.append_header('Access-Control-Allow-Origin', "*")
        resp.status = falcon.HTTP_200
Faxen answered 12/10, 2021 at 19:42 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.