Run selenium on docker container without headless
Asked Answered
C

0

6

Context

I would like to run my scraper in a docker container without headless because it's taking more time than with headless (Actually i don't why, it doesn't make sense for me but that is not my main question). My scraper is coded with Scrapy and Selenium (python).

When i run my scraper on my laptop (not on a docker container) it's working perfectly with headless and without headless so i guess my problem it's related to the chrome that i installed through my Dockerfile.

I need to run my scraper in a docker container because i have to run it on a ECS instance on AWS to schedule my scraper every day.

If you need any more resources, just tell me ;)

Thank you in advance !!

Problem

It's working with headless but when i remove headless from the options i have this error :

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/twisted/internet/defer.py", line 1660, in _inlineCallbacks
    result = current_context.run(gen.send, result)
  File "/usr/local/lib/python3.8/site-packages/scrapy/core/downloader/middleware.py", line 41, in process_request
    response = yield deferred_from_coro(method(request=request, spider=spider))
  File "/scrapy_wouahome/middlewares.py", line 67, in process_request
    driver = uc.Chrome(
  File "/usr/local/lib/python3.8/site-packages/seleniumwire/undetected_chromedriver/v2.py", line 55, in __init__
    super().__init__(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/undetected_chromedriver/v2.py", line 302, in __init__
    super(Chrome, self).__init__(
  File "/usr/local/lib/python3.8/site-packages/selenium/webdriver/chrome/webdriver.py", line 69, in __init__
    super(WebDriver, self).__init__(DesiredCapabilities.CHROME['browserName'], "goog",
  File "/usr/local/lib/python3.8/site-packages/selenium/webdriver/chromium/webdriver.py", line 93, in __init__
    RemoteWebDriver.__init__(
  File "/usr/local/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py", line 248, in __init__
    self.start_session(capabilities, browser_profile)
  File "/usr/local/lib/python3.8/site-packages/undetected_chromedriver/v2.py", line 577, in start_session
    super(Chrome, self).start_session(capabilities, browser_profile)
  File "/usr/local/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py", line 339, in start_session
    response = self.execute(Command.NEW_SESSION, parameters)
  File "/usr/local/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py", line 400, in execute
    self.error_handler.check_response(response)
  File "/usr/local/lib/python3.8/site-packages/selenium/webdriver/remote/errorhandler.py", line 236, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: cannot connect to chrome at 127.0.0.1:41619
from chrome not reachable

Resources

Chrome options :

import seleniumwire.undetected_chromedriver.v2 as uc



options = uc.ChromeOptions()

# options.add_argument("--headless")
options.add_argument('--disable-gpu')
options.add_argument('--disable-dev-shm-usage')

options.add_argument(f'--user-agent={self.get_random_ua()}')
options.add_argument('--no-first-run')
options.add_argument('--no-service-autorun')
options.add_argument('--no-default-browser-check')
options.add_argument('--password-store=basic')

options.add_argument('--no-proxy-server')

seleniumwire_options = {
  # 'proxy': {
  #     'http': random_proxy,
  #     'https': random_proxy,
  # }
}

driver = uc.Chrome(options=options, seleniumwire_options=seleniumwire_options)

Docker file :

# As Scrapy runs on Python, I choose the official Python 3 Docker image.
FROM --platform=linux/amd64 python:3.8

RUN apt-get update \
  && apt-get install -y --no-install-recommends wget xvfb unzip

#============================================
# Google Chrome
#============================================
# can specify versions by CHROME_VERSION;
#  e.g. google-chrome-stable=53.0.2785.101-1
#       google-chrome-beta=53.0.2785.92-1
#       google-chrome-unstable=54.0.2840.14-1
#       latest (equivalent to google-chrome-stable)
#       google-chrome-beta  (pull latest beta)
#============================================
ARG CHROME_VERSION="google-chrome-stable"
RUN wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add - \
  && echo "deb http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list \
  && apt-get update -qqy \
  && apt-get -qqy install \
  ${CHROME_VERSION:-google-chrome-stable} \
  && rm /etc/apt/sources.list.d/google-chrome.list \
  && rm -rf /var/lib/apt/lists/* /var/cache/apt/*

#============================================
# Chrome webdriver
#============================================
# can specify versions by CHROME_DRIVER_VERSION
# Latest released version will be used by default
#============================================
ARG CHROME_DRIVER_VERSION
RUN if [ -z "$CHROME_DRIVER_VERSION" ]; \
  then CHROME_MAJOR_VERSION=$(google-chrome --version | sed -E "s/.* ([0-9]+)(\.[0-9]+){3}.*/\1/") \
  && CHROME_DRIVER_VERSION=$(wget --no-verbose -O - "https://chromedriver.storage.googleapis.com/LATEST_RELEASE_${CHROME_MAJOR_VERSION}"); \
  fi \
  && echo "Using chromedriver version: "$CHROME_DRIVER_VERSION \
  && wget --no-verbose -O /tmp/chromedriver_linux64.zip https://chromedriver.storage.googleapis.com/$CHROME_DRIVER_VERSION/chromedriver_linux64.zip \
  && rm -rf /opt/selenium/chromedriver \
  && unzip /tmp/chromedriver_linux64.zip -d /opt/selenium \
  && rm /tmp/chromedriver_linux64.zip \
  && mv /opt/selenium/chromedriver /opt/selenium/chromedriver-$CHROME_DRIVER_VERSION \
  && chmod 755 /opt/selenium/chromedriver-$CHROME_DRIVER_VERSION \
  && ln -fs /opt/selenium/chromedriver-$CHROME_DRIVER_VERSION /usr/bin/chromedriver

# Copy the file from the local host to the filesystem of the container at the working directory.
COPY requirements.txt ./

# Install Scrapy specified in requirements.txt.
RUN pip install -r requirements.txt

RUN pip install selenium-wire

# Copy the project source code from the local host to the filesystem of the container at the working directory.
COPY . .

# Run the crawler when the container launches.
CMD [ "python3", "./go-spider.py" ]
Communistic answered 2/5, 2022 at 17:49 Comment(1)
Did you ever find a solution to this? I'm dealing with the same challenge.Pinder

© 2022 - 2024 — McMap. All rights reserved.