How to use Selenium in Databricks and accessing and moving downloaded files to mounted storage and keep Chrome and ChromeDriver versions in sync?
Asked Answered
P

3

6

I've seen a couple of posts on using Selenium in Databricks using %shto install Chrome Drivers and Chrome. This works fine for me, but I had a lot of trouble when I needed to download a file. The file would download, but I could not find it in the filesystem in databricks. Even if I changed the download path when instatiating Chrome to a mounted folder on Azure Blob Storage, the file would not be placed there after downloading. There is also a problem of keeping the Chrome browser and ChromeDriver version in sync automatically without manually changing the version numbers.

Following links show people with the same problem but no clear answer:

https://forums.databricks.com/questions/19376/if-my-notebook-downloads-a-file-from-a-website-by.html

https://forums.databricks.com/questions/45388/selenium-in-databricks-with-add-experimental-optio.html

Is there a way to identify where the file gets downloaded in Azure Databricks when I do web automation using Selenium Python?

And some struggling with getting Selenium to run properly at all: https://forums.databricks.com/questions/14814/selenium-in-databricks.html

not in path error: https://webcache.googleusercontent.com/search?q=cache:NrvVKo4LLdIJ:https://mcmap.net/q/1632127/-cannot-get-selenium-webdriver-to-work-in-azure-databricks+&cd=5&hl=en&ct=clnk&gl=us

Is there a clear guide to use Selenium on Databricks and manage downloaded files? And how can I keep the Chrome browser and ChromeDriver versions in sync automatically?

Parrish answered 4/6, 2021 at 0:26 Comment(0)
P
17

Here is the guide to installing Selenium, Chrome, and ChromeDriver. This will also move a file after downloading via Selenium to your mounted storage. Each number should be in its own cell.

  1. Install Selenium
%pip install selenium
  1. Do your imports
import pickle as pkl
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
  1. Download the latest ChromeDriver to the DBFS root storage /tmp/. The curl command will get the latest Chrome version and store in the version variable. Note the escape \ before the $.
%sh
version=`curl -sS https://chromedriver.storage.googleapis.com/LATEST_RELEASE`
wget -N https://chromedriver.storage.googleapis.com/\${version}/chromedriver_linux64.zip -O /tmp/chromedriver_linux64.zip

  1. Unzip the file to a new folder in the DBFS root /tmp/. I tried to use non-root path and it does not work.
%sh
unzip /tmp/chromedriver_linux64.zip -d /tmp/chromedriver/
  1. Get the latest Chrome download and install it.
%sh
sudo curl -sS -o - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add
sudo echo "deb https://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list
sudo apt-get -y update
sudo apt-get -y install google-chrome-stable

** Steps 3 - 5 can be combined into one command. You can also use the following to create a shell script and use it as an init file to configure for your clusters and is especially useful when using job clusters which use transient clusters because init scripts apply to all worker nodes rather than just the driver node. This also installs Selenium, allowing you to skip step 1. Just paste in one cell in a new notebook, run, then point your init script to dbfs:/init/init_selenium.sh. Now every time the cluster or transient cluster spins up, this will install Chrome, ChromeDriver, and Selenium on all worker nodes before your job begins to run.

%sh
# dbfs:/init/init_selenium.sh
cat > /dbfs/init/init_selenium.sh <<EOF
#!/bin/sh
echo Install Chrome and Chrome driver
version=`curl -sS https://chromedriver.storage.googleapis.com/LATEST_RELEASE`
wget -N https://chromedriver.storage.googleapis.com/\${version}/chromedriver_linux64.zip -O /tmp/chromedriver_linux64.zip
unzip /tmp/chromedriver_linux64.zip -d /tmp/chromedriver/
sudo curl -sS -o - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add
sudo echo "deb https://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list
sudo apt-get -y update
sudo apt-get -y install google-chrome-stable
pip install selenium
EOF
cat /dbfs/init/init_selenium.sh
  1. Configure your storage account. Example is for Azure Blob Storage using ADLSGen2.
service_principal_id = "YOUR_SP_ID"
service_principle_key = "YOUR_SP_KEY"
tenant_id = "YOUR_TENANT_ID"
directory = "https://login.microsoftonline.com/" + tenant_id + "/oauth2/token"
configs = {"fs.azure.account.auth.type": "OAuth",
       "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
       "fs.azure.account.oauth2.client.id":  service_principal_id,
       "fs.azure.account.oauth2.client.secret": service_principle_key,
       "fs.azure.account.oauth2.client.endpoint": directory,
       "fs.azure.createRemoteFileSystemDuringInitialization": "true"}
  1. Configure your mounting location and mount.
mount_point = "/mnt/container-data/"
mount_point_main = "/dbfs/mnt/container-data/"
container = "container-data"
storage_account = "adlsgen2"
storage = "abfss://"+ container +"@"+ storage_account + ".dfs.core.windows.net"
utils_folder = mount_point + "utils/selenium/"
raw_folder = mount_point + "raw/"

if not any(mount_point in mount_info for mount_info in dbutils.fs.mounts()):
  dbutils.fs.mount(
    source = storage,
    mount_point = mount_point,
    extra_configs = configs)
  print(mount_point + " has been mounted.")
else:
  print(mount_point + " was already mounted.")
print(f"Utils folder: {utils_folder}")
print(f"Raw folder: {raw_folder}")
  1. Create method for instantiating Chrome browser. I need to load in a cookies file that I have placed in my utils folder which points to mnt/container-data/utils/selenium. Make sure the arguments are the same (no sandbox, headless, disable-dev-shm-usage)
def init_chrome_browser(download_path, chrome_driver_path, cookies_path, url):
    """
    Instatiates a Chrome browser.

    Parameters
    ----------
    download_path : str
        The download path to place files downloaded from this browser session.
    chrome_driver_path : str
        The path of the chrome driver executable binary (.exe file).
    cookies_path : str
        The path of the cookie file to load in (.pkl file).
    url : str
        The URL address of the page to initially load.

    Returns
    -------
    Browser
        Returns the instantiated browser object.
    """
    
    options = Options()
    prefs = {'download.default_directory' : download_path}
    options.add_experimental_option('prefs', prefs)
    options.add_argument('--no-sandbox')
    options.add_argument('--headless')
    options.add_argument('--disable-dev-shm-usage')
    options.add_argument('--start-maximized')
    options.add_argument('window-size=2560,1440')
    print(f"{datetime.now()}    Launching Chrome...")
    browser = webdriver.Chrome(service=Service(chrome_driver_path), options=options)
    print(f"{datetime.now()}    Chrome launched.")
    browser.get(url)
    print(f"{datetime.now()}    Loading cookies...")
    cookies = pkl.load(open(cookies_path, "rb"))
    for cookie in cookies:
        browser.add_cookie(cookie)
    browser.get(url)
    print(f"{datetime.now()}    Cookies loaded.")
    print(f"{datetime.now()}    Browser ready to use.")
    return browser
  1. Instatiate browser. Set the downloads location to the DBFS root file system /tmp/downloads. Make sure the cookies path has /dbfs in front so the full cookies path is like /dbfs/mnt/...
browser = init_chrome_browser(
    download_path="/tmp/downloads",
    chrome_driver_path="/tmp/chromedriver/chromedriver",
    cookies_path="/dbfs"+ utils_folder + "cookies.pkl",
    url="YOUR_URL"
)
  1. Do your navigating and any downloads you need.

  2. OPTIONAL: Examine your download location. In this example, I downloaded a CSV file and will search through the downloaded folder until I find that file format.

import os
import os.path
for root, directories, filenames in os.walk('/tmp'):
    print(root)
    if any(".csv" in s for s in filenames):
        print(filenames)
        break
  1. Copy the file from DBFS root tmp to your mounted storage (/mnt/container-data/raw/). You can rename during this operation as well. You can only access root file system using file: prefix when using dbutils.
dbutils.fs.cp("file:/tmp/downloads/file1.csv", f"{raw_folder}file2.csv')
Parrish answered 4/6, 2021 at 0:26 Comment(4)
You van use webdriver-manager to skip the whole downloading and installing part: %pip install webdriver-manager from webdriver_manager.chrome import ChromeDriverManager browser = webdriver.Chrome(ChromeDriverManager().install(), chrome_options=chrome_options)Stygian
@MichaelH. This may work on non-Databricks notebooks but I have not found success in when using in Databricks notebook. Plus my solution reduces external library dependency.Parrish
@kindofhungry: I am receiving "Launching Chrome... NameError: name 'Service' is not defined" where should i put/define Service?Neckerchief
@AliSaberi Hmm... Ensure that you are using the latest version of Selenium. Try with importing from selenium.webdriver.chrome.service import Service Let me know if that works. I may have forgotten that import when I updated my answerParrish
S
0

I've been using the guide on the first answer to install Selenium in Databricks, the third step didn't work for me because of the '\' symbol before the '${version}'.

%sh
version=`curl -sS https://chromedriver.storage.googleapis.com/LATEST_RELEASE`
wget -N https://chromedriver.storage.googleapis.com/${version}/chromedriver_linux64.zip -O /tmp/chromedriver_linux64.zip

Also, changing the init_chrome_browser function didn't give me an error when using ChromeDriverManager.

options = webdriver.ChromeOptions()
prefs = {'download.default_directory' : download_path}
options.add_experimental_option('prefs', prefs)
options.add_argument('--no-sandbox')
options.add_argument('--headless')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--start-maximized')
options.add_argument('window-size=2560,1440')
print(f"{datetime.now()}    Launching Chrome...")
browser = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=options)
Swizzle answered 27/10, 2023 at 8:25 Comment(0)
J
0

I used the answer provided by @kindofhungry as well as this resource to finally get Selenium working in my databricks notebook. I had to tweak a couple of things in order to get it working for me so hopefully someone else finds this helpful.

Make sure to import Service:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service

On Step 3 I had to use this code:

%sh
version=`curl -sS https://chromedriver.storage.googleapis.com/LATEST_RELEASE`
wget -N https://chromedriver.storage.googleapis.com/${version}/chromedriver_linux64.zip -O /tmp/chromedriver_linux64.zip

For Step 4 I had to do this instead:

%sh
sudo mkdir -p /tmp/chromedriver
sudo unzip -o /tmp/chromedriver_linux64.zip -d /tmp/chromedriver

For Step 5 I had to manually regress the Chrome version to match the driver:

%sh
sudo apt-get update
wget https://dl.google.com/linux/chrome/deb/pool/main/g/google-chrome-stable/google-chrome-stable_114.0.5735.198-1_amd64.deb
sudo apt-get -y update
sudo apt -y install ./google-chrome*.deb
Jesicajeske answered 18/1 at 22:27 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.