Here is the guide to installing Selenium, Chrome, and ChromeDriver. This will also move a file after downloading via Selenium to your mounted storage. Each number should be in its own cell.
- Install Selenium
%pip install selenium
- Do your imports
import pickle as pkl
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
- Download the latest ChromeDriver to the DBFS root storage
/tmp/
. The curl command will get the latest Chrome version and store in the version
variable. Note the escape \
before the $
.
%sh
version=`curl -sS https://chromedriver.storage.googleapis.com/LATEST_RELEASE`
wget -N https://chromedriver.storage.googleapis.com/\${version}/chromedriver_linux64.zip -O /tmp/chromedriver_linux64.zip
- Unzip the file to a new folder in the DBFS root
/tmp/
. I tried to use non-root path and it does not work.
%sh
unzip /tmp/chromedriver_linux64.zip -d /tmp/chromedriver/
- Get the latest Chrome download and install it.
%sh
sudo curl -sS -o - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add
sudo echo "deb https://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list
sudo apt-get -y update
sudo apt-get -y install google-chrome-stable
** Steps 3 - 5 can be combined into one command. You can also use the following to create a shell script and use it as an init file to configure for your clusters and is especially useful when using job clusters which use transient clusters because init scripts apply to all worker nodes rather than just the driver node. This also installs Selenium, allowing you to skip step 1. Just paste in one cell in a new notebook, run, then point your init script to dbfs:/init/init_selenium.sh
. Now every time the cluster or transient cluster spins up, this will install Chrome, ChromeDriver, and Selenium on all worker nodes before your job begins to run.
%sh
# dbfs:/init/init_selenium.sh
cat > /dbfs/init/init_selenium.sh <<EOF
#!/bin/sh
echo Install Chrome and Chrome driver
version=`curl -sS https://chromedriver.storage.googleapis.com/LATEST_RELEASE`
wget -N https://chromedriver.storage.googleapis.com/\${version}/chromedriver_linux64.zip -O /tmp/chromedriver_linux64.zip
unzip /tmp/chromedriver_linux64.zip -d /tmp/chromedriver/
sudo curl -sS -o - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add
sudo echo "deb https://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list
sudo apt-get -y update
sudo apt-get -y install google-chrome-stable
pip install selenium
EOF
cat /dbfs/init/init_selenium.sh
- Configure your storage account. Example is for Azure Blob Storage using ADLSGen2.
service_principal_id = "YOUR_SP_ID"
service_principle_key = "YOUR_SP_KEY"
tenant_id = "YOUR_TENANT_ID"
directory = "https://login.microsoftonline.com/" + tenant_id + "/oauth2/token"
configs = {"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": service_principal_id,
"fs.azure.account.oauth2.client.secret": service_principle_key,
"fs.azure.account.oauth2.client.endpoint": directory,
"fs.azure.createRemoteFileSystemDuringInitialization": "true"}
- Configure your mounting location and mount.
mount_point = "/mnt/container-data/"
mount_point_main = "/dbfs/mnt/container-data/"
container = "container-data"
storage_account = "adlsgen2"
storage = "abfss://"+ container +"@"+ storage_account + ".dfs.core.windows.net"
utils_folder = mount_point + "utils/selenium/"
raw_folder = mount_point + "raw/"
if not any(mount_point in mount_info for mount_info in dbutils.fs.mounts()):
dbutils.fs.mount(
source = storage,
mount_point = mount_point,
extra_configs = configs)
print(mount_point + " has been mounted.")
else:
print(mount_point + " was already mounted.")
print(f"Utils folder: {utils_folder}")
print(f"Raw folder: {raw_folder}")
- Create method for instantiating Chrome browser. I need to load in a cookies file that I have placed in my
utils
folder which points to mnt/container-data/utils/selenium
. Make sure the arguments are the same (no sandbox, headless, disable-dev-shm-usage)
def init_chrome_browser(download_path, chrome_driver_path, cookies_path, url):
"""
Instatiates a Chrome browser.
Parameters
----------
download_path : str
The download path to place files downloaded from this browser session.
chrome_driver_path : str
The path of the chrome driver executable binary (.exe file).
cookies_path : str
The path of the cookie file to load in (.pkl file).
url : str
The URL address of the page to initially load.
Returns
-------
Browser
Returns the instantiated browser object.
"""
options = Options()
prefs = {'download.default_directory' : download_path}
options.add_experimental_option('prefs', prefs)
options.add_argument('--no-sandbox')
options.add_argument('--headless')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--start-maximized')
options.add_argument('window-size=2560,1440')
print(f"{datetime.now()} Launching Chrome...")
browser = webdriver.Chrome(service=Service(chrome_driver_path), options=options)
print(f"{datetime.now()} Chrome launched.")
browser.get(url)
print(f"{datetime.now()} Loading cookies...")
cookies = pkl.load(open(cookies_path, "rb"))
for cookie in cookies:
browser.add_cookie(cookie)
browser.get(url)
print(f"{datetime.now()} Cookies loaded.")
print(f"{datetime.now()} Browser ready to use.")
return browser
- Instatiate browser. Set the downloads location to the DBFS root file system
/tmp/downloads
. Make sure the cookies path has /dbfs
in front so the full cookies path is like /dbfs/mnt/...
browser = init_chrome_browser(
download_path="/tmp/downloads",
chrome_driver_path="/tmp/chromedriver/chromedriver",
cookies_path="/dbfs"+ utils_folder + "cookies.pkl",
url="YOUR_URL"
)
Do your navigating and any downloads you need.
OPTIONAL: Examine your download location. In this example, I downloaded a CSV file and will search through the downloaded folder until I find that file format.
import os
import os.path
for root, directories, filenames in os.walk('/tmp'):
print(root)
if any(".csv" in s for s in filenames):
print(filenames)
break
- Copy the file from DBFS root tmp to your mounted storage (
/mnt/container-data/raw/
). You can rename during this operation as well. You can only access root file system using file:
prefix when using dbutils.
dbutils.fs.cp("file:/tmp/downloads/file1.csv", f"{raw_folder}file2.csv')
%pip install webdriver-manager
from webdriver_manager.chrome import ChromeDriverManager
browser = webdriver.Chrome(ChromeDriverManager().install(), chrome_options=chrome_options)
– Stygian