Python Selenium to Download File to Memory
Asked Answered
F

1

1

I have a number of scripts that scrape the web, grab files, then read them using pandas. This procedure must be deployed under a new architecture in which downloading files from disc is not acceptable; instead, the file should be saved in memory and read with pandas from there.

The Websites doesn't provide a direct link to the file rather it has provided with a button that uses form submission to download it.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.wait import WebDriverWait
from webdriver_manager.chrome import ChromeDriverManager


chrome_options = webdriver.ChromeOptions()
prefs = {'download': {'default_directory': #a link to memory}}
chrome_options.add_experimental_option("prefs", prefs)
driver = webdriver.Chrome(options=chrome_options,service=Service(ChromeDriverManager().install()))

driver.get("https://www.speedchex.com/")
driver = login(driver) 

WebDriverWait(driver,15).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,"#CSVButton")))
driver.find_element(By.CSS_SELECTOR,'#CSVButton').click()   #This button Downloads the file.
download_wait()  # a function to check if download is finished or not
driver.quit()

Donwload_wait is just a function that will check the directory if there are any .crdownload files.

def download_wait():
    path_to_downloads = OUTPUT_FOLDER
    seconds = 0
    dl_wait = True
    while dl_wait and seconds < 200:
        time.sleep(1)
        dl_wait = False
        for fname in os.walk.files(filter=['*.crdownload']):
                dl_wait = True
        seconds += 1
    return seconds

The input tag that downloads the file is as follows.

<input name="CSVButton" type="button" id="CSVButton"  onclick="javascript: this.form.OutputType.value = 'CSV'; this.form.submit(); this.form.OutputType.value = 'HTML'; " value="CSV">
Fenny answered 11/2, 2022 at 13:34 Comment(13)
You asked me to look at this question. Can you share the code in this function download_wait()?Gettysburg
The function will just check for pending download files(will work on normal file system). This would only work when we have a virtual file system concept.Fenny
Is there a HREF link tied to #CSVButton?Gettysburg
Also what is the final file format after the download is completed?Gettysburg
No there is no href. The final format is csv.Fenny
In addition, I have added the html tag that downloads the file too.Fenny
So when you click the button it download a CSV file a file system on a physical disk?Gettysburg
Also how are you handling the CSV file after it has been downloaded?Gettysburg
Yes, that's exactly what it's doing now. I can change the download path too but that not something I need. I will be deploying the script on a server that doesn't supports the filesystem so what i need is that selenium just download the file to memory that I can directly read using pandas and not on physical disk. Thanks for reading.Fenny
Ok. Thanks. I will put some code together later today for you. You will have to test it on the server.Gettysburg
Thanks Mate.! .Fenny
So far the auto download isn' working to the default directory. Does this driver.find_element(By.CSS_SELECTOR,'#CSVButton')..get_attribute("href") have a value?Gettysburg
Have you found a solution @TigerStrom ? I'm having the exact same issue. Fiddling with the OS may not be an option.Sippet
M
1

Selenium actually just pass commands down to the browser, in a different process than your Python program - so the usual approach of creating an object that emulates a file (io.BytesIO) can't work in this case.

Your only approach is to create an in-memory filesystem, and set the browser dwonload directory to have it as its target.How to create an in memory filesystem and were it is located will vary with your Operating System, but on Linux it is as easy as sudo mount -t tmpfs -o size=1024m myramdisk <mountpoint> (Use subprocess or plain os.system to issue that command). You can e even use "/home/user/Downloads" as the mountpoint, and then you won't need to worry about changing any config in the browser.

It will work as a normal filesystem both from your browser and from your selenium script program - the normal file operation calls will work on it - You just have to arrange to de-create the filesystem upon program exit

For that, the "atexit" handler Python has can be usefull - https://docs.python.org/3/library/atexit.html

Mussorgsky answered 11/2, 2022 at 13:50 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.