I have a number of scripts that scrape the web, grab files, then read them using pandas. This procedure must be deployed under a new architecture in which downloading files from disc is not acceptable; instead, the file should be saved in memory and read with pandas from there.
The Websites doesn't provide a direct link to the file rather it has provided with a button that uses form submission to download it.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.wait import WebDriverWait
from webdriver_manager.chrome import ChromeDriverManager
chrome_options = webdriver.ChromeOptions()
prefs = {'download': {'default_directory': #a link to memory}}
chrome_options.add_experimental_option("prefs", prefs)
driver = webdriver.Chrome(options=chrome_options,service=Service(ChromeDriverManager().install()))
driver = login(driver)
driver.find_element(By.CSS_SELECTOR,'#CSVButton').click() #This button Downloads the file.
download_wait() # a function to check if download is finished or not
Donwload_wait is just a function that will check the directory if there are any .crdownload
def download_wait():
path_to_downloads = OUTPUT_FOLDER
seconds = 0
dl_wait = True
while dl_wait and seconds < 200:
dl_wait = False
for fname in os.walk.files(filter=['*.crdownload']):
dl_wait = True
seconds += 1
return seconds
The input tag that downloads the file is as follows.
<input name="CSVButton" type="button" id="CSVButton" onclick="javascript: this.form.OutputType.value = 'CSV'; this.form.submit(); this.form.OutputType.value = 'HTML'; " value="CSV">
? – Gettysburg#CSVButton
? – Gettysburgdriver.find_element(By.CSS_SELECTOR,'#CSVButton')..get_attribute("href")
have a value? – Gettysburg