How to solve problems with urllib3.connectionpool used in Selenium when parallelizing automation?
Asked Answered
N

0

6

Quick description

I'm processing many pages with selenium sequentially but to improve the performance I've decided to parallelize the processing - split the pages between more threads (It can be done since the pages are independent from one another).

Here is the simplified code:

def process_page(driver, page, lock):
    driver.get("page.url()")
    driver.find_element_by_css_selector("some selector")
    wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "some selector")))
    .
    .
    .
    with lock:
        for i in range(result_tuple.__len__()):
            logger.info(result_tuple[i])
    return result_tuple

def process_all_pages():
    def pages_processing(id, lock):
        result = []
        with MyWebDriver(webdriver_options) as driver:
            for i in range(50):
                result.append(process_page(driver, pages[id * 50 + i], lock))
        return result

    lock = threading.Lock()

    with ThreadPoolExecutor(4) as executor:
        futures = []
        for i in range(4):
            futures.append(executor.submit(pages_processing, i, lock))

        result = []
        for i in range(futures.__len__()):
            result.append(futures[i].result())

    return result

MyWebDriver is just a simple context manager for Chrome driver, when entering context it spawns a new instance of the Chrome driver and when it exits the context, then it quits the given Chrome instance.

This code spawns 4 Chrome drivers separately for every thread and makes some selenium work in the Chrome drivers, also every thread separately.

The problem

For the first few seconds it works like a charm but after some time there start to be warnings in the logger and the Selenium seems to stop communicating with the Chrome drivers.

  • The same behavior appears with any number of threads except when it runs on a single thread.
  • The same behavior either running on Windows or Ubuntu

If needed I could also provide debug logs but not sure if there's something relevant.

The warnings in the logger:

...
# With these first warnings selenium stops to communicate with some Chrome drivers - just nothing happens in some of them.
WARNING - urllib3.connectionpool - Connection pool is full, discarding connection: 127.0.0.1
WARNING - urllib3.connectionpool - Connection pool is full, discarding connection: 127.0.0.1
...
# These warnings come a bit later
WARNING - urllib3.connectionpool - Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000018343AB24A8>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')': /session/9c9fc148f278aaa360a26d95eac0966e/url
WARNING - urllib3.connectionpool - Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000018348854E10>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')': /session/9c9fc148f278aaa360a26d95eac0966e/url
WARNING - urllib3.connectionpool - Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000018348869710>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')': /session/9c9fc148f278aaa360a26d95eac0966e/url
...

Tested workarounds

I've tried these patches to set higher maxsize (HTTPConnectionPool, HTTPSConnectionPool) - https://mcmap.net/q/273284/-change-the-connection-pool-size-for-python-39-s-quot-requests-quot-module-when-in-threading - this didn't fix the problem, btw. the patches were executed.

Next I've tried to set higher num_pools in the class PoolManager - I've changed this only in the sources and also the maxsize in the HTTPConnectionPool and HTTPSConnectionPool. This actually solved one issue - no warnings were in the log BUT the selenium communication with the driver got still frozen.

Nosology answered 27/1, 2020 at 20:30 Comment(4)
This won't work. Use Puppeteer/Pyppeteer if you must have concurrency.Betti
@Betti I think it is possible - in my example it is running separately in the threads. Some picks from conversations confirming the idea: #30809106 groups.google.com/forum/#!msg/webdriver/cw_awztl-IM/…Nosology
Confirming the idea that Selenium is non-blocking/thread-safe? Sorry, I think you misunderstood what you read.Betti
@Betti nope, confirming the idea that multiple instances can be run simultaneously on different threads.Nosology

© 2022 - 2024 — McMap. All rights reserved.