seleniumbase (undetected Chrome driver): how to set request header?
Asked Answered
Q

2

7

I am using seleniumbase with Driver(uc=True), which works well for my specific scraping use case (and appears to be the only driver that consistently remains undetected for me).

It is fine for everything that doesn't need specific header settings.

For one particular type of scrape I need to set the Request Header (Accept -> application/json).

This works fine, and consistently, done manually in Chrome via the Requestly extension, but I cannot work out how to put it in place for seleniumbase undetected Chrome.

I tried using execute_cdp_cmd with Network.setExtraHTTPHeaders (with Network.enable first): this ran without error but the request appeared to ignore it. (I was, tbh, unconvinced that the uc=True support was handling this functionality properly, since it doesn't appear to have full Chromium driver capabilities.)

Requestly has a selenium Python mechanism, but this has its own driver and I cannot see how it would integrate with seleniumbase undetected Chrome.

The built-in seleniumbase wire=True support won't coexist with uc=True, as far as I can see.

selenium-requests has an option to piggyback on an existing driver, but this is (to be honest) beyond my embryonic Python skills (though it does feel like this might be the answer if I knew how to put it in place).

My scraping requires initial login, so I can't really swap from one driver to another in the course of the scraping session.

Queenhood answered 16/11, 2023 at 11:48 Comment(0)
Q
2

My code fragments from second effective solution derived from now deleted bountied answer (the .v2 was the piece I had not seen previously and which I think is what made it work):

...
from seleniumwire import webdriver
from selenium.webdriver.chrome.options import Options
from seleniumwire.undetected_chromedriver.v2 import Chrome, ChromeOptions
...
chrome_options = ChromeOptions()
driver = Chrome(seleniumwire_options={'options': chrome_options})
driver.header_overrides = {
    'Accept': 'application/json',
}
...
Queenhood answered 14/1 at 19:31 Comment(0)
Q
2

I have finally found a simple and extraordinarily effective solution, working correctly with uc=True, via javascript as provided here: https://github.com/ultrafunkamsterdam/undetected-chromedriver/issues/871.

Code fragments:

from seleniumbase import Driver

driver = Driver(uc=True)
login()

response = driver.execute_async_script("var callback = arguments[arguments.length - 1]; fetch('" + url + "', {method: 'GET', headers: {'Accept' : 'application/json'}}).then((response) => response.text().then((text) => callback({'status': response.status, 'text': text})))")
print(url + ':' + str(response['status']))
if response['status'] == 200:
    with io.open(outfile, 'w', encoding='utf8', newline='\n') as f:
        f.write(response['text'])
return response['status']

This works very well for my specific use case, which just involves invoking an API via Get and getting JSON content back (all repeated over and over again).

This also allows me to get the response status, which has made the whole thing much more resilient.

Finally, the performance is fantastic - not surprisingly, I guess, given the much shorter code path.

Queenhood answered 20/11, 2023 at 16:39 Comment(0)
Q
2

My code fragments from second effective solution derived from now deleted bountied answer (the .v2 was the piece I had not seen previously and which I think is what made it work):

...
from seleniumwire import webdriver
from selenium.webdriver.chrome.options import Options
from seleniumwire.undetected_chromedriver.v2 import Chrome, ChromeOptions
...
chrome_options = ChromeOptions()
driver = Chrome(seleniumwire_options={'options': chrome_options})
driver.header_overrides = {
    'Accept': 'application/json',
}
...
Queenhood answered 14/1 at 19:31 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.