Quick description
I'm processing many pages with selenium sequentially but to improve the performance I've decided to parallelize the processing - split the pages between more threads (It can be done since the pages are independent from one another).
Here is the simplified code:
def process_page(driver, page, lock):
driver.get("page.url()")
driver.find_element_by_css_selector("some selector")
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "some selector")))
.
.
.
with lock:
for i in range(result_tuple.__len__()):
logger.info(result_tuple[i])
return result_tuple
def process_all_pages():
def pages_processing(id, lock):
result = []
with MyWebDriver(webdriver_options) as driver:
for i in range(50):
result.append(process_page(driver, pages[id * 50 + i], lock))
return result
lock = threading.Lock()
with ThreadPoolExecutor(4) as executor:
futures = []
for i in range(4):
futures.append(executor.submit(pages_processing, i, lock))
result = []
for i in range(futures.__len__()):
result.append(futures[i].result())
return result
MyWebDriver is just a simple context manager for Chrome driver, when entering context it spawns a new instance of the Chrome driver and when it exits the context, then it quits the given Chrome instance.
This code spawns 4 Chrome drivers separately for every thread and makes some selenium work in the Chrome drivers, also every thread separately.
The problem
For the first few seconds it works like a charm but after some time there start to be warnings in the logger and the Selenium seems to stop communicating with the Chrome drivers.
- The same behavior appears with any number of threads except when it runs on a single thread.
- The same behavior either running on Windows or Ubuntu
If needed I could also provide debug logs but not sure if there's something relevant.
The warnings in the logger:
...
# With these first warnings selenium stops to communicate with some Chrome drivers - just nothing happens in some of them.
WARNING - urllib3.connectionpool - Connection pool is full, discarding connection: 127.0.0.1
WARNING - urllib3.connectionpool - Connection pool is full, discarding connection: 127.0.0.1
...
# These warnings come a bit later
WARNING - urllib3.connectionpool - Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000018343AB24A8>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')': /session/9c9fc148f278aaa360a26d95eac0966e/url
WARNING - urllib3.connectionpool - Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000018348854E10>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')': /session/9c9fc148f278aaa360a26d95eac0966e/url
WARNING - urllib3.connectionpool - Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000018348869710>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it')': /session/9c9fc148f278aaa360a26d95eac0966e/url
...
Tested workarounds
I've tried these patches to set higher maxsize (HTTPConnectionPool, HTTPSConnectionPool) - https://mcmap.net/q/273284/-change-the-connection-pool-size-for-python-39-s-quot-requests-quot-module-when-in-threading - this didn't fix the problem, btw. the patches were executed.
Next I've tried to set higher num_pools in the class PoolManager - I've changed this only in the sources and also the maxsize in the HTTPConnectionPool and HTTPSConnectionPool. This actually solved one issue - no warnings were in the log BUT the selenium communication with the driver got still frozen.