Selenium headless: How to bypass Cloudflare detection using Selenium
Asked Answered
M

8

37

Hoping an expert can help me with a Selenium/Cloudflare mystery. I can get a website to load in normal (non-headless) Selenium, but no matter what I try, I can't get it to load in headless.

I have followed the suggestions from the StackOverflow posts like Is there a version of Selenium WebDriver that is not detectable?. I've also looked at all the properties of window and window.navigator objects and fixed all the diffs between headless and non-headless, but somehow headless is still being detected. At this point I am extremely curious how Cloudflare could possibly figure out the difference. Thank you for the time!

List of the things I have tried:

  • User-agent
  • Replace cdc_ with another string in chromedriver
  • options.add_experimental_option("excludeSwitches", ["enable-automation"])
  • options.add_experimental_option('useAutomationExtension', False)
  • options.add_argument('--disable-blink-features=AutomationControlled') (this was necessary to get website to load in non-headless)
  • Set navigator.webdriver = undefined
  • Set navigator.plugins, navigator.languages, and navigator.mimeTypes
  • Set window.ScreenY, window.screenTop, window.outerWidth, window.outerHeight to be nonzero
  • Set window.chrome and window.navigator.chrome
  • Set width and height of images to be nonzero
  • Set WebGL parameters
  • Fix Modernizr

Replicating the experiment

In order to get the website to load in normal (non-headless) Selenium, you have to follow a _blank link from another website (so that the target website opens in another tab). To replicate the experiment, first create an html file with the content <a href="https://poocoin.app" target="_blank">link</a>, and then paste the path to this html file in the following code.

The version below (non-headless) runs fine and loads the website, but if you set options.headless = True, it will get stuck on Cloudflare.

from selenium import webdriver
import time

# Replace this with the path to your html file
FULL_PATH_TO_HTML_FILE = 'file:///Users/simplepineapple/html/url_page.html'

def visit_website(browser):
    browser.get(FULL_PATH_TO_HTML_FILE)
    time.sleep(3)

    links = browser.find_elements_by_xpath("//a[@href]")
    links[0].click()
    time.sleep(10)

    # Switch webdriver focus to new tab so that we can extract html
    tab_names = browser.window_handles
    if len(tab_names) > 1:
        browser.switch_to.window(tab_names[1])

    time.sleep(1)
    html = browser.page_source
    print(html)
    print()
    print()

    if 'Charts' in html:
        print('Success')
    else:
        print('Fail')

    time.sleep(10)


options = webdriver.ChromeOptions()
# If options.headless = True, the website will not load
options.headless = False
options.add_argument("--window-size=1920,1080")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
options.add_argument('--disable-blink-features=AutomationControlled')
options.add_argument('user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36')

browser = webdriver.Chrome(options = options)

browser.execute_cdp_cmd('Page.addScriptToEvaluateOnNewDocument', {
    "source": '''
    Object.defineProperty(navigator, 'webdriver', {
        get: () => undefined
    });
    Object.defineProperty(navigator, 'plugins', {
            get: function() { return {"0":{"0":{}},"1":{"0":{}},"2":{"0":{},"1":{}}}; }
    });
    Object.defineProperty(navigator, 'languages', {
        get: () => ["en-US", "en"]
    });
    Object.defineProperty(navigator, 'mimeTypes', {
        get: function() { return {"0":{},"1":{},"2":{},"3":{}}; }
    });

    window.screenY=23;
    window.screenTop=23;
    window.outerWidth=1337;
    window.outerHeight=825;
    window.chrome =
    {
      app: {
        isInstalled: false,
      },
      webstore: {
        onInstallStageChanged: {},
        onDownloadProgress: {},
      },
      runtime: {
        PlatformOs: {
          MAC: 'mac',
          WIN: 'win',
          ANDROID: 'android',
          CROS: 'cros',
          LINUX: 'linux',
          OPENBSD: 'openbsd',
        },
        PlatformArch: {
          ARM: 'arm',
          X86_32: 'x86-32',
          X86_64: 'x86-64',
        },
        PlatformNaclArch: {
          ARM: 'arm',
          X86_32: 'x86-32',
          X86_64: 'x86-64',
        },
        RequestUpdateCheckStatus: {
          THROTTLED: 'throttled',
          NO_UPDATE: 'no_update',
          UPDATE_AVAILABLE: 'update_available',
        },
        OnInstalledReason: {
          INSTALL: 'install',
          UPDATE: 'update',
          CHROME_UPDATE: 'chrome_update',
          SHARED_MODULE_UPDATE: 'shared_module_update',
        },
        OnRestartRequiredReason: {
          APP_UPDATE: 'app_update',
          OS_UPDATE: 'os_update',
          PERIODIC: 'periodic',
        },
      },
    };
    window.navigator.chrome =
    {
      app: {
        isInstalled: false,
      },
      webstore: {
        onInstallStageChanged: {},
        onDownloadProgress: {},
      },
      runtime: {
        PlatformOs: {
          MAC: 'mac',
          WIN: 'win',
          ANDROID: 'android',
          CROS: 'cros',
          LINUX: 'linux',
          OPENBSD: 'openbsd',
        },
        PlatformArch: {
          ARM: 'arm',
          X86_32: 'x86-32',
          X86_64: 'x86-64',
        },
        PlatformNaclArch: {
          ARM: 'arm',
          X86_32: 'x86-32',
          X86_64: 'x86-64',
        },
        RequestUpdateCheckStatus: {
          THROTTLED: 'throttled',
          NO_UPDATE: 'no_update',
          UPDATE_AVAILABLE: 'update_available',
        },
        OnInstalledReason: {
          INSTALL: 'install',
          UPDATE: 'update',
          CHROME_UPDATE: 'chrome_update',
          SHARED_MODULE_UPDATE: 'shared_module_update',
        },
        OnRestartRequiredReason: {
          APP_UPDATE: 'app_update',
          OS_UPDATE: 'os_update',
          PERIODIC: 'periodic',
        },
      },
    };
    ['height', 'width'].forEach(property => {
        const imageDescriptor = Object.getOwnPropertyDescriptor(HTMLImageElement.prototype, property);

        // redefine the property with a patched descriptor
        Object.defineProperty(HTMLImageElement.prototype, property, {
            ...imageDescriptor,
            get: function() {
                // return an arbitrary non-zero dimension if the image failed to load
            if (this.complete && this.naturalHeight == 0) {
                return 20;
            }
                return imageDescriptor.get.apply(this);
            },
        });
    });

    const getParameter = WebGLRenderingContext.getParameter;
    WebGLRenderingContext.prototype.getParameter = function(parameter) {
        if (parameter === 37445) {
            return 'Intel Open Source Technology Center';
        }
        if (parameter === 37446) {
            return 'Mesa DRI Intel(R) Ivybridge Mobile ';
        }

        return getParameter(parameter);
    };

    const elementDescriptor = Object.getOwnPropertyDescriptor(HTMLElement.prototype, 'offsetHeight');

    Object.defineProperty(HTMLDivElement.prototype, 'offsetHeight', {
        ...elementDescriptor,
        get: function() {
            if (this.id === 'modernizr') {
            return 1;
            }
            return elementDescriptor.get.apply(this);
        },
    });
    '''
})

visit_website(browser)

browser.quit()
Marela answered 7/7, 2021 at 16:6 Comment(2)
Are you talking about "I'm under attack mode"? That will run some some js tests that you won't be able to spoof (timing drawing things on canvas maybe?).Annalisaannalise
Thank you for the detailed description of how to make things work in a non-headless mode. I have reproduced your experiment and get exactly the same behaviour. I don't have answer to your question, but perhaps you, like myself, can use some virtual framebuffer device to simulate non-headless mode. For me Xvnc worked, I used it because I want to have a chance to observe the process anyway. Perhaps you can get away with more lightweight Xvfb.Orlena
R
35

Using the latest Google Chrome v96.0 if you retrive the useragent

  • For the browser the following is in use:

    Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36
    
  • Where as for browser the following is in use:

    Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/96.0.4664.110 Safari/537.36
    

In majority of the cases the presence of the additional Headless string/parameter/attribute is intercepted as a and blocks the access to the website.


Solution

There are different approaches to evade the Cloudflare detection even using Chrome in mode and some of the efficient approaches are as follows:

  • An efficient solution would be to use the undetected-chromedriver to initialize the Chrome Browsing Context. undetected-chromedriver is an optimized Selenium Chromedriver patch which does not trigger anti-bot services like Distill Network / Imperva / DataDome / Botprotect.io. It automatically downloads the driver binary and patches it.

    • Code Block:

      import undetected_chromedriver as uc
      from selenium import webdriver
      
      options = webdriver.ChromeOptions() 
      options.headless = True
      options.add_argument("start-maximized")
      options.add_experimental_option("excludeSwitches", ["enable-automation"])
      options.add_experimental_option('useAutomationExtension', False)
      driver = uc.Chrome(options=options)
      driver.get('https://bet365.com')
      

You can find a couple of relevant detailed discussions in:

  • The most efficient solution would be to use Selenium Stealth to initialize the Chrome Browsing Context. selenium-stealth is a python package to prevent detection. This programme tries to make python selenium more stealthy.

    • Code Block:

      from selenium import webdriver
      from selenium_stealth import stealth
      
      options = webdriver.ChromeOptions()
      options.add_argument("start-maximized")
      options.add_argument("--headless")
      options.add_experimental_option("excludeSwitches", ["enable-automation"])
      options.add_experimental_option('useAutomationExtension', False)
      driver = webdriver.Chrome(options=options, executable_path=r"C:\path\to\chromedriver.exe")
      
      stealth(driver,
              languages=["en-US", "en"],
              vendor="Google Inc.",
              platform="Win32",
              webgl_vendor="Intel Inc.",
              renderer="Intel Iris OpenGL Engine",
              fix_hairline=True,
              )
      
      driver.get("https://bot.sannysoft.com/")
      

You can find a couple of relevant detailed discussions in:

Rintoul answered 27/12, 2021 at 21:6 Comment(4)
Thank you, seems Cloudflare was detecting headless chrome and flagging the site in my case, have since changed the user-agent, though would have preferred to use the default oneIcebox
For me, undetected_chromedriver did the trick. I did not need any of the options you mention like 'excludeSwitches' and 'useAutomaticExtension'. selenium_stealth did not work for meCapillaceous
the solution doesn't seems to be working nowIndispose
selenium_stealth works for meRhizobium
S
2

You need to read source code of latest chromium. It removes large amount of functionality in headless mode. What are cloudflare developers doing? They are finding places where is headless mode is used and trying to separate headless and not headless objects behaviour. There are many workarounds in chromium today that makes internal headless mode detection the easy task.

Meanwhile I can't understand guys using internal chromium headless mode. You can just use headless wayland or headless X11 mode and forget about this case. It will help to concentrate on more important things.

Snowberry answered 18/9, 2023 at 21:38 Comment(2)
This seems like the best way forward, but I do not even know where to start. Do you have any resources I could follow? I tried googling for headless wayland but only super geeky hard to follow reseources are coming up,Norty
@tomitrescak, I've prepared a docker image recently docker-ncalayer. It launches Kazakhstan keys service in gui mode inside container. Please read entrypoint.sh, it includes launcher for sway.Snowberry
A
1

@undetected Selenium's answer works perfectly with https://github.com/diprajpatra/selenium-stealth

If you are using the latest version of selenium, you will need to change executable_path parameter as it's depreciated, example code:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium_stealth import stealth

options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_argument("--headless")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)

s=Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=s, options=options)

stealth(driver,
        languages=["en-US", "en"],
        vendor="Google Inc.",
        platform="Win32",
        webgl_vendor="Intel Inc.",
        renderer="Intel Iris OpenGL Engine",
        fix_hairline=True,
)

driver.get("https://bot.sannysoft.com/")

print(driver.find_element(By.XPATH, "/html/body").text)

driver.close()
Aristotelianism answered 20/9, 2022 at 13:38 Comment(0)
F
1

I have mixed both the libraries undetected-chromedriver and selenium-stealth, which has solved my problem. It is no longer detectable by Cloudflare Challenge.

Following is a function that I am using to generate a driver for me:

import undetected_chromedriver as uc
from selenium_stealth import stealth

def gen_driver(self):
    try:
        user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.6167.140 Safari/537.36"
        chrome_options = uc.ChromeOptions()
        chrome_options.add_argument('--headless=new')
        chrome_options.add_argument("--start-maximized")
        chrome_options.add_argument("user-agent={}".format(user_agent))
        driver = uc.Chrome(options=chrome_options)
        stealth(driver,
                languages=["en-US", "en"],
                vendor="Google Inc.",
                platform="Win32",
                webgl_vendor="Intel Inc.",
                renderer="Intel Iris OpenGL Engine",
                fix_hairline=True
        )
        return driver
    except Exception as e:
        print("Error in Driver: ",e)

In the selenium-stealth documentation it was recommended to add the following options too:

chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
chrome_options.add_experimental_option('useAutomationExtension', False)

However, these options do not work with undetected-chromedriver so I removed them. Everything else is the same.

Not sure why this works but my guess is that selenium-stealth adds some render information that bypasses Cloudflare.

Flagler answered 3/2 at 19:50 Comment(9)
Hello! do you think if Firefox has something like that? Thank you!Finegrained
It seems this one does not open a new browser. However, I need to click on the "ok" button to get to the main page. What could I do to get on the page?Parkin
it is not working on Iryo for example :(Finegrained
@M.Mariscal I think you can use firefox driver with this.Flagler
@JimmyZhao I also noticed that but in my case it remains their for few seconds and then vanishes. But you can also click it, there is a way in selenium. Please check it out here: guru99.com/alert-popup-handling-selenium.htmlFlagler
@M.Mariscal My solution only works for basic Cloudflare Challenges.Flagler
@MuhammadMobeen Thanks, is there any reference for Python usage?Parkin
@JimmyZhao Check this out: #71538164Flagler
Worked for me, thanks. I needed to run the driver in headless mode on the Heroku platform and I only managed to do so using this solution.Entreat
M
0

The only thing I can suggets in addition - to improove your plugins and mime types for navigator sometimes can be use property as typeof(navigator.plugins, 'PluginsArray')

Object.defineProperty(navigator, 'plugins', {
    get: () => {
        var ChromiumPDFPlugin = {};
        var plugin = {
            ChromiumPDFPlugin,
            description: 'Portable Document Format',
            filename: 'internal-pdf-viewer',
            length: 1,
            name: 'Chromium PDF Plugin',

        };
        plugin.__proto__ = Plugin.prototype;

        var plugins = {
            0: plugin,
            length: 1
        };
        plugins.__proto__ = PluginArray.prototype;
        return plugins;
    },
});

Object.defineProperty(navigator, 'mimeTypes', {
    get: () => {
        var mimeType = {
            type: 'application/pdf',
            suffixes: 'pdf',
            description: 'Portable Document Format',
            enabledPlugin: Plugin

        };
        mimeType.__proto__ = MimeType.prototype;

        var mimeTypes = {
            0: mimeType,
            length: 1
        };
        mimeTypes.__proto__ = MimeTypeArray.prototype;
        return mimeTypes;
    },
});

Good website to check what's going wrong in headless mode is https://bot.sannysoft.com/

You can run in headless mode and create page snapshot to check if all passed

P.s. also, sometimes, even if navigator.webdriver is set to undefined, navigator still contains webdriver prop You can simply rm using code below:

const newProto = navigator.__proto__;
delete newProto.webdriver;
navigator.__proto__ = newProto;
Makhachkala answered 25/4, 2023 at 14:49 Comment(0)
B
0

so based on @undetected-selenium answer to achieve best hidden exp use both

    import undetected_chromedriver as uc
    from selenium import webdriver
    from selenium_stealth import stealth
    
    options = webdriver.ChromeOptions() 
    options.headless = True
    options.add_argument("start-maximized")
    
    options.add_experimental_option("excludeSwitches", ["enable-automation"])
    options.add_experimental_option('useAutomationExtension', False)
    driver = uc.Chrome(options=options, executable_path=r"C:\path\to\chromedriver.exe")
    
    stealth(driver,
            languages=["en-US", "en"],
            vendor="Google Inc.",
            platform="Win32",
            webgl_vendor="Intel Inc.",
            renderer="Intel Iris OpenGL Engine",
            fix_hairline=True,
            )
    
    driver.get("https://logical.com/")
Backandforth answered 3/4 at 10:6 Comment(0)
D
-1

The cloudflare protection IUAM is used primary to avoid ddos attacks and for consequence it also protect sites from automation bot exploitation so no matter what you are using in the client side the cloudflare server is fingerprinting you. After that they send to the client side the cf_clearance a cookie that allows you to connect for the next 15 minutes.

enter image description here

Decedent answered 31/12, 2021 at 14:40 Comment(1)
I noticed the cf_clearance cookie is used to bypass the CAPTCHA once validated but even if I reuse this cookie in my WebDriver script, it is still asking me to complete the CAPTCHA while it is still a valid cookie in Firefox without WebDriver. The user agent is the same, so they are checking something else, maybe navigator.webdriver JavaScript variable?Swaim
J
-2

pip install undetected-chromedriver

You can use this module

Jenny answered 15/5, 2023 at 9:5 Comment(2)
Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.Endothecium
The top-voted answer from a year and a half before this answer already suggests installing undetected-chromedriver. Please don't repeat answers.Scruggs

© 2022 - 2024 — McMap. All rights reserved.