Context
I would like to run my scraper in a docker container without headless because it's taking more time than with headless (Actually i don't why, it doesn't make sense for me but that is not my main question). My scraper is coded with Scrapy and Selenium (python).
When i run my scraper on my laptop (not on a docker container) it's working perfectly with headless and without headless so i guess my problem it's related to the chrome that i installed through my Dockerfile.
I need to run my scraper in a docker container because i have to run it on a ECS instance on AWS to schedule my scraper every day.
If you need any more resources, just tell me ;)
Thank you in advance !!
Problem
It's working with headless but when i remove headless from the options i have this error :
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/twisted/internet/defer.py", line 1660, in _inlineCallbacks
result = current_context.run(gen.send, result)
File "/usr/local/lib/python3.8/site-packages/scrapy/core/downloader/middleware.py", line 41, in process_request
response = yield deferred_from_coro(method(request=request, spider=spider))
File "/scrapy_wouahome/middlewares.py", line 67, in process_request
driver = uc.Chrome(
File "/usr/local/lib/python3.8/site-packages/seleniumwire/undetected_chromedriver/v2.py", line 55, in __init__
super().__init__(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/undetected_chromedriver/v2.py", line 302, in __init__
super(Chrome, self).__init__(
File "/usr/local/lib/python3.8/site-packages/selenium/webdriver/chrome/webdriver.py", line 69, in __init__
super(WebDriver, self).__init__(DesiredCapabilities.CHROME['browserName'], "goog",
File "/usr/local/lib/python3.8/site-packages/selenium/webdriver/chromium/webdriver.py", line 93, in __init__
RemoteWebDriver.__init__(
File "/usr/local/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py", line 248, in __init__
self.start_session(capabilities, browser_profile)
File "/usr/local/lib/python3.8/site-packages/undetected_chromedriver/v2.py", line 577, in start_session
super(Chrome, self).start_session(capabilities, browser_profile)
File "/usr/local/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py", line 339, in start_session
response = self.execute(Command.NEW_SESSION, parameters)
File "/usr/local/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py", line 400, in execute
self.error_handler.check_response(response)
File "/usr/local/lib/python3.8/site-packages/selenium/webdriver/remote/errorhandler.py", line 236, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: cannot connect to chrome at 127.0.0.1:41619
from chrome not reachable
Resources
Chrome options :
import seleniumwire.undetected_chromedriver.v2 as uc
options = uc.ChromeOptions()
# options.add_argument("--headless")
options.add_argument('--disable-gpu')
options.add_argument('--disable-dev-shm-usage')
options.add_argument(f'--user-agent={self.get_random_ua()}')
options.add_argument('--no-first-run')
options.add_argument('--no-service-autorun')
options.add_argument('--no-default-browser-check')
options.add_argument('--password-store=basic')
options.add_argument('--no-proxy-server')
seleniumwire_options = {
# 'proxy': {
# 'http': random_proxy,
# 'https': random_proxy,
# }
}
driver = uc.Chrome(options=options, seleniumwire_options=seleniumwire_options)
Docker file :
# As Scrapy runs on Python, I choose the official Python 3 Docker image.
FROM --platform=linux/amd64 python:3.8
RUN apt-get update \
&& apt-get install -y --no-install-recommends wget xvfb unzip
#============================================
# Google Chrome
#============================================
# can specify versions by CHROME_VERSION;
# e.g. google-chrome-stable=53.0.2785.101-1
# google-chrome-beta=53.0.2785.92-1
# google-chrome-unstable=54.0.2840.14-1
# latest (equivalent to google-chrome-stable)
# google-chrome-beta (pull latest beta)
#============================================
ARG CHROME_VERSION="google-chrome-stable"
RUN wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add - \
&& echo "deb http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list \
&& apt-get update -qqy \
&& apt-get -qqy install \
${CHROME_VERSION:-google-chrome-stable} \
&& rm /etc/apt/sources.list.d/google-chrome.list \
&& rm -rf /var/lib/apt/lists/* /var/cache/apt/*
#============================================
# Chrome webdriver
#============================================
# can specify versions by CHROME_DRIVER_VERSION
# Latest released version will be used by default
#============================================
ARG CHROME_DRIVER_VERSION
RUN if [ -z "$CHROME_DRIVER_VERSION" ]; \
then CHROME_MAJOR_VERSION=$(google-chrome --version | sed -E "s/.* ([0-9]+)(\.[0-9]+){3}.*/\1/") \
&& CHROME_DRIVER_VERSION=$(wget --no-verbose -O - "https://chromedriver.storage.googleapis.com/LATEST_RELEASE_${CHROME_MAJOR_VERSION}"); \
fi \
&& echo "Using chromedriver version: "$CHROME_DRIVER_VERSION \
&& wget --no-verbose -O /tmp/chromedriver_linux64.zip https://chromedriver.storage.googleapis.com/$CHROME_DRIVER_VERSION/chromedriver_linux64.zip \
&& rm -rf /opt/selenium/chromedriver \
&& unzip /tmp/chromedriver_linux64.zip -d /opt/selenium \
&& rm /tmp/chromedriver_linux64.zip \
&& mv /opt/selenium/chromedriver /opt/selenium/chromedriver-$CHROME_DRIVER_VERSION \
&& chmod 755 /opt/selenium/chromedriver-$CHROME_DRIVER_VERSION \
&& ln -fs /opt/selenium/chromedriver-$CHROME_DRIVER_VERSION /usr/bin/chromedriver
# Copy the file from the local host to the filesystem of the container at the working directory.
COPY requirements.txt ./
# Install Scrapy specified in requirements.txt.
RUN pip install -r requirements.txt
RUN pip install selenium-wire
# Copy the project source code from the local host to the filesystem of the container at the working directory.
COPY . .
# Run the crawler when the container launches.
CMD [ "python3", "./go-spider.py" ]