Download large file in python with requests
Asked Answered
S

9

621

Requests is a really nice library. I'd like to use it for downloading big files (>1GB). The problem is it's not possible to keep whole file in memory; I need to read it in chunks. And this is a problem with the following code:

import requests

def DownloadFile(url)
    local_filename = url.split('/')[-1]
    r = requests.get(url)
    f = open(local_filename, 'wb')
    for chunk in r.iter_content(chunk_size=512 * 1024): 
        if chunk: # filter out keep-alive new chunks
            f.write(chunk)
    f.close()
    return 

For some reason it doesn't work this way; it still loads the response into memory before it is saved to a file.

Sat answered 22/5, 2013 at 14:47 Comment(1)
The requests library is nice, but not intended for this purpose. I would suggest using a different library such as urlib3. #17285964Wound
S
978

With the following streaming code, the Python memory usage is restricted regardless of the size of the downloaded file:

def download_file(url):
    local_filename = url.split('/')[-1]
    # NOTE the stream=True parameter below
    with requests.get(url, stream=True) as r:
        r.raise_for_status()
        with open(local_filename, 'wb') as f:
            for chunk in r.iter_content(chunk_size=8192): 
                # If you have chunk encoded response uncomment if
                # and set chunk_size parameter to None.
                #if chunk: 
                f.write(chunk)
    return local_filename

Note that the number of bytes returned using iter_content is not exactly the chunk_size; it's expected to be a random number that is often far bigger, and is expected to be different in every iteration.

See body-content-workflow and Response.iter_content for further reference.

Sat answered 22/5, 2013 at 15:52 Comment(41)
@Shuman This code successfully downloads files which are bigger then 1.5Gb. Can you download the file via any browser successfully?Sat
yes in firefox if i download manually, it successfully saves out a 1.5GB .zip fileOrpah
@Orpah As I see you resolved the issue when switched from http:// to https:// (github.com/kennethreitz/requests/issues/2043). Can you please update or delete your comments because people may think that there are issues with the code for files bigger 1024MbSat
the chunk_size is crucial. by default it's 1 (1 byte). that means that for 1MB it'll make 1 milion iterations. docs.python-requests.org/en/latest/api/…Hipolitohipp
Is it possible to parallelize the iter_content() part somehow to speed up the download? Thanks!Phytophagous
@RomanPodlinov, do you mind telling me why are you using the flush?Balbriggan
you could use url.rsplit('/', 1)[1] as well, which will not split the whole url but only the last part of it.Petree
@RovinBhandari: to parallelize, find out whether there is support for bytes range http header in requestsPorcia
url.split('/')[-1] might be too simplistic e.g., see url2filename()Porcia
f.flush() seems unnecessary. What are you trying to accomplish using it? (your memory usage won't be 1.5gb if you drop it). f.write(b'') (if iter_content() may return an empty string) should be harmless and therefore if chunk could be dropped too.Porcia
@J.F.Sebastian agree url2filename is better. About flash. The idea is flush data into physical file on drive. If you see that the code works good without flash() just remove it.Sat
@RomanPodlinov: f.flush() doesn't flush data to physical disk. It transfers the data to OS. Usually, it is enough unless there is a power failure. f.flush() makes the code slower here for no reason. The flush happens when the correponding file buffer (inside app) is full. If you need more frequent writes; pass buf.size parameter to open().Porcia
@J.F.Sebastian Thank you I commented the flush row in the codeSat
if chunk: # filter out keep-alive new chunks – it is redundant, isn't it? Since iter_content() always yields string and never yields None, it looks like premature optimization. I also doubt it can ever yield empty string (I cannot imagine any reason for this).Highjack
In the case you use dropbox links, it will save your file with a name like "Banner_apus_1.23.zip?dl=1"Scrivener
@paus Double check what you provide as a link. If Dropbox adds something into url (or redirect to other url) you can easy remove it. Just change how you set local_filename variable.Sat
@Highjack Plz put your attention on the comment "filter out keep-alive new chunks" for this line. If you download file with size in several GBs it makes total sense.Sat
@RomanPodlinov I'm not familiar with the term "keep-alive new chunks". Can you explain it a bit further? There are keep-alive (persistent) connections (when several HTTP requests are contained in a single TCP connection) and chunked responses (when there is not Content-Length header and content is divided into chunks, the last one is zero-length). AFAIK, these two features are independent, they have nothing in common.Highjack
@RomanPodlinov Another point: iter_content() always yields string. There is nothing wrong with writing an empty string to file, right. So, why we should check the length?Highjack
@RomanPodlinov And one more point, sorry :) After reading iter_content() sources I've concluded that it cannot ever yield an empty string: there are emptiness checks everywhere. The main logic here: requests/packages/urllib3/response.py.Highjack
But why not shutil.copyfileobj?Expectorant
@Expectorant coz response and response.iter_content is not file-like object?Peterec
@Expectorant Example with shutil.copyfileobj is below by using Response.rawPeterec
@Highjack "I'm not familiar with the term "keep-alive new chunks"." On the one hand I don't know who added this comment into thew code, on the other hand you change my words. This line of code removes empty chunks which appears from time to time probably because of keep-alive requests during downloadSat
@RomanPodlinov in regards to the "keep-alive chunks" check that you and y0prst were discussing; was the conclusion that it is unnecessary because requests never returns an empty string thanks to internal checks?Curtis
@RomanPodlinov this line seems to suggest so at least in the case of 'file-like objects': github.com/kennethreitz/requests/blob/…Curtis
@for a 5GB file above code is taking forever, what would be the ideal chunk size to use in this case?anything we can do to improve the speed of downloadingRiella
@Riella In your case I recommend to use my small lib github.com/keepitsimple/pyFTPclient it can reconnect and use multiple simultaneous connections for download. I used this small lib for downloading files of size 1-10 GBsSat
@RomanPodlinov - I couldn't;t adapt pyFTPclien to download from a link, lets say https://hostname.company.com/ui/containers/9888577 ,how would the following lines change to download from a link obj = PyFTPclient('192.168.0.59', 2121, 'test', 'testftp') obj.DownloadFile('USAHD-8974-20131013-0300-0330.ts')Riella
@Riella pyFTPclient was implemented for FTP protocol.Sat
And remember flush after the writing to a file with stream=True if you're trying to get hash / size of the files right after the download - you may be missing a few (hundreds) bytes if you don't .Trinitrocresol
How do you know this is not occupying lots of memory? Looking at the process monitor? When I run: import sys print(sys.getsizeof(r.text)) I get the same size outputted whether I use your stream code above or notUnreconstructed
@newbieI don't know what OS do you use. I use htop under Linux or Process Monitor from SysInternals.com under WindowsSat
to @0xcaff "Don't forget to close the connection with r.close()" - No. it's wrong. with will close connection automaticallySat
@RomanPodlinov It was not using with when I made this comment.Estaestablish
I can't download a zip file of 221MB. The downloaded file size maxes out at 219KB every time i tried this code.Bayard
what if i use f=requests.get(url, stream=True) then for chunk in f.iter_content(chunk_size=8192) without using with, would it work?Pusillanimous
@RomanPodlinov is it right to write directly to disk?. That is, 8192 or 512*1024 bytes are not so bigger then for a 100mb file this will do too much "write" operations.. could be this an issue? how can I handle it?Herb
how can i write the same for post methodBlessington
i would suggest os.path.basename(url) to get filenamePisarik
so if 'transfer-encoding' in r.headers.keys(): if 'chunked' in r.headers['transer-encoding']: chunk_size = noneHarrus
A
545

It's much easier if you use Response.raw and shutil.copyfileobj():

import requests
import shutil

def download_file(url):
    local_filename = url.split('/')[-1]
    with requests.get(url, stream=True) as r:
        with open(local_filename, 'wb') as f:
            shutil.copyfileobj(r.raw, f)

    return local_filename

This streams the file to disk without using excessive memory, and the code is simple.

Note: According to the documentation, Response.raw will not decode gzip and deflate transfer-encodings, so you will need to do this manually.

Absorptance answered 30/8, 2016 at 2:13 Comment(25)
Note that you may need to adjust when streaming gzipped responses per issue 2155.Carpology
Have you tested this code for big files download >1gb?Sat
Yes I did. most of files were > 1GB. the code was downloading a bunch of video files on daily basisSat
THIS should be the correct answer! The accepted answer gets you up to 2-3MB/s. Using copyfileobj gets you to ~40MB/s. Curl downloads (same machines, same url, etc) with ~50-55 MB/s.Investigator
@Investigator how did you check the download speeds?Evonneevonymus
@Evonneevonymus From python, dividing the download time vs the file size. I used fairily larg files (100M-2G) over a gigabit connection. The server was more or less in the same network/datacenter.Investigator
A small caveat for using .raw is that it does not handle decoding. Mentioned in the docs here: docs.python-requests.org/en/master/user/quickstart/…Kennykeno
Is it possible to stream to stdout through the print?Laryssa
@VitalyZdanevich: Try shutil.copyfileobj(r.raw, sys.stdout).Absorptance
@Investigator I was able to match the download speeds between raw and iter_content after I increased chunk_size from 1024 to 10*1024 (debian ISO, regular connection)Drunkometer
The issue with the accepted answer is the chunk size. If you have a sufficiently fast connection, 1KiB is too small, you spend too much time on overhead compared to transferring data. shutil.copyfileobj defaults to 16KiB chunks. Increasing the chunk size from 1KiB will almost certainly increase download rate, but don't increase too much. I am using 1MiB chunks and it works well, it approaches full bandwidth usage. You could try to monitor connection rate and adjust chunk size based on it, but beware premature optimization.Plumlee
@EricCousineau You can patch up this behaviour replacing the read method: response.raw.read = functools.partial(response.raw.read, decode_content=True)Determinate
Is there any way to limit the streaming read here to a max value, say 128 KiB?Widen
Meanwhile it's 2019. I took the freedom to edit the missing @with requests.get(url, stream=True) as r:@ into the answer. There's no reason not to use it.Trivial
@vog, the source code (at least, in the latest requests) already includes the with statement with sessions.Session() as session: return session.request(method=method, url=url, **kwargs)Rim
Adding length param got me better download speeds shutil.copyfileobj(r.raw, f, length=16*1024*1024)Bedspring
For me, I got back an appropriate sized object, but my machine told me the file was corrupt. I am working with pdf files and no application can open what I just downloaded.Christinachristine
just gonna bump this because this was so FAST and simple to download multiple 1GB+ files compared to othersPlayful
for me this results in an invalid tarball: gzip: stdin: not in gzip format but if I download it via browser the tar format is gzip.Snavely
Updated link to github issue 2155 about streaming gzipped responses (the link in ChrisP's answer no longer works).Oligopoly
it seems to me that shutil.copyfileobj is returning before the download is complete. Is there a way of blocking until the file has completely downloaded?Coffeepot
@Coffeepot Are you perhaps seeing some delay in your filesystem? shutil.copyfileobj doesn't exactly return before completion, but your filesystem may have some delay before readers observe the file being completely written.Absorptance
@JohnZwinck yes that could be it. I couldn't figure out an elegant way to check that the full file had been written, but I haven't seen any issues since I added a simple sleep.Coffeepot
I like this approach better, but how would I implement tqdm with this one?Negotiate
@SourceMatters: if a progress bar is important to you, this solution won't be the most straightforward.Absorptance
E
114

Not exactly what OP was asking, but... it's ridiculously easy to do that with urllib:

from urllib.request import urlretrieve

url = 'http://mirror.pnl.gov/releases/16.04.2/ubuntu-16.04.2-desktop-amd64.iso'
dst = 'ubuntu-16.04.2-desktop-amd64.iso'
urlretrieve(url, dst)

Or this way, if you want to save it to a temporary file:

from urllib.request import urlopen
from shutil import copyfileobj
from tempfile import NamedTemporaryFile

url = 'http://mirror.pnl.gov/releases/16.04.2/ubuntu-16.04.2-desktop-amd64.iso'
with urlopen(url) as fsrc, NamedTemporaryFile(delete=False) as fdst:
    copyfileobj(fsrc, fdst)

I watched the process:

watch 'ps -p 18647 -o pid,ppid,pmem,rsz,vsz,comm,args; ls -al *.iso'

And I saw the file growing, but memory usage stayed at 17 MB. Am I missing something?

Eggert answered 5/6, 2017 at 22:13 Comment(2)
For Python 2.x, use from urllib import urlretrieveHarv
This function "might become deprecated at some point in the future." cf. docs.python.org/3/library/urllib.request.html#legacy-interfaceCommeasure
P
46

Your chunk size could be too large, have you tried dropping that - maybe 1024 bytes at a time? (also, you could use with to tidy up the syntax)

def DownloadFile(url):
    local_filename = url.split('/')[-1]
    r = requests.get(url)
    with open(local_filename, 'wb') as f:
        for chunk in r.iter_content(chunk_size=1024): 
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)
    return 

Incidentally, how are you deducing that the response has been loaded into memory?

It sounds as if python isn't flushing the data to file, from other SO questions you could try f.flush() and os.fsync() to force the file write and free memory;

    with open(local_filename, 'wb') as f:
        for chunk in r.iter_content(chunk_size=1024): 
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)
                f.flush()
                os.fsync(f.fileno())
Phoney answered 22/5, 2013 at 15:2 Comment(6)
I use System Monitor in Kubuntu. It shows me that python process memory increases (up to 1.5gb from 25kb).Sat
That memory bloat sucks, maybe f.flush(); os.fsync() might force a write an memory free.Phoney
it's os.fsync(f.fileno())Paternalism
You need to use stream=True in the requests.get() call. That's what's causing the memory bloat.Flannel
minor typo: you miss a colon (':') after def DownloadFile(url)Slain
What if i don't want to save it as a file but in BytesIO ???Priestly
E
12

use wget module of python instead. Here is a snippet

import wget
wget.download(url)
Eiland answered 19/10, 2020 at 4:9 Comment(2)
This is a very old an unmaitained module.Fournier
The OP is specifically asking how to do it in python with requests. Jumping out of python space is not usually an option.Brawn
I
10

Based on the Roman's most upvoted comment above, here is my implementation, Including "download as" and "retries" mechanism:

def download(url: str, file_path='', attempts=2):
    """Downloads a URL content into a file (with large file support by streaming)

    :param url: URL to download
    :param file_path: Local file name to contain the data downloaded
    :param attempts: Number of attempts
    :return: New file path. Empty string if the download failed
    """
    if not file_path:
        file_path = os.path.realpath(os.path.basename(url))
    logger.info(f'Downloading {url} content to {file_path}')
    url_sections = urlparse(url)
    if not url_sections.scheme:
        logger.debug('The given url is missing a scheme. Adding http scheme')
        url = f'http://{url}'
        logger.debug(f'New url: {url}')
    for attempt in range(1, attempts+1):
        try:
            if attempt > 1:
                time.sleep(10)  # 10 seconds wait time between downloads
            with requests.get(url, stream=True) as response:
                response.raise_for_status()
                with open(file_path, 'wb') as out_file:
                    for chunk in response.iter_content(chunk_size=1024*1024):  # 1MB chunks
                        out_file.write(chunk)
                logger.info('Download finished successfully')
                return file_path
        except Exception as ex:
            logger.error(f'Attempt #{attempt} failed with error: {ex}')
    return ''
Insolent answered 5/7, 2020 at 17:15 Comment(0)
U
3

Here is additional approach for the use-case of async chunked download, without reading all the file content to memory.
It means that both read from the URL and the write to file are implemented with asyncio libraries (aiohttp to read from the URL and aiofiles to write the file).

The following code should work on Python 3.7 and later.
Just edit SRC_URL and DEST_FILE variables before copy and paste.

import aiofiles
import aiohttp
import asyncio

async def async_http_download(src_url, dest_file, chunk_size=65536):
    async with aiofiles.open(dest_file, 'wb') as fd:
        async with aiohttp.ClientSession() as session:
            async with session.get(src_url) as resp:
                async for chunk in resp.content.iter_chunked(chunk_size):
                    await fd.write(chunk)

SRC_URL = "/path/to/url"
DEST_FILE = "/path/to/file/on/local/machine"

asyncio.run(async_http_download(SRC_URL, DEST_FILE))
Ultramodern answered 1/8, 2022 at 13:53 Comment(0)
W
2

requests is good, but how about socket solution?

def stream_(host):
    import socket
    import ssl
    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as sock:
        context = ssl.create_default_context(Purpose.CLIENT_AUTH)
        with context.wrap_socket(sock, server_hostname=host) as wrapped_socket:
            wrapped_socket.connect((socket.gethostbyname(host), 443))
            wrapped_socket.send(
                "GET / HTTP/1.1\r\nHost:thiscatdoesnotexist.com\r\nAccept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9\r\n\r\n".encode())

            resp = b""
            while resp[-4:-1] != b"\r\n\r":
                resp += wrapped_socket.recv(1)
            else:
                resp = resp.decode()
                content_length = int("".join([tag.split(" ")[1] for tag in resp.split("\r\n") if "content-length" in tag.lower()]))
                image = b""
                while content_length > 0:
                    data = wrapped_socket.recv(2048)
                    if not data:
                        print("EOF")
                        break
                    image += data
                    content_length -= len(data)
                with open("image.jpeg", "wb") as file:
                    file.write(image)

Wash answered 2/10, 2021 at 19:19 Comment(4)
I'm curious what's the advantange of using this instead of a higher level (and well tested) method from libs like requests?Airtoair
Libs like requests are full of abstraction above the native sockets. That's not the best algorithm, but it could be faster because of no abstraction at all.Wash
It appears this loads the whole content into memory in the "image" variable, then writes it to a file. How does this work for large files with local memory constraints?Shannon
Yeah, you can just modify this if you want. For example: change the last part with image variable and write to file itself instead of variableWash
S
2

Yet another option for downloading large files. This will allow you to stop and continue later (press the Enter key to stop), and continue from where you left off if your connection gets dropped otherwise.

import datetime
import os
import requests
import threading as th

keep_going = True
def key_capture_thread():
    global keep_going
    input()
    keep_going = False
pkey_capture = th.Thread(target=key_capture_thread, args=(), name='key_capture_process', daemon=True).start()

def download_file(url, local_filepath):
    #assumptions:
    #  headers contain Content-Length:
    #  headers contain Accept-Ranges: bytes
    #  stream is not encoded (otherwise start bytes are not known, unless this is stored seperately)
    
    chunk_size = 1048576 #1MB
    # chunk_size = 8096 #8KB
    # chunk_size = 1024 #1KB
    decoded_bytes_downloaded_this_session = 0
    start_time = datetime.datetime.now()
    if os.path.exists(local_filepath):
        decoded_bytes_downloaded = os.path.getsize(local_filepath)
    else:
        decoded_bytes_downloaded = 0
    with requests.Session() as s:
        with s.get(url, stream=True) as r:
            #check for required headers:
            if 'Content-Length' not in r.headers:
                print('STOP: request headers do not contain Content-Length')
                return
            if ('Accept-Ranges','bytes') not in r.headers.items():
                print('STOP: request headers do not contain Accept-Ranges: bytes')
                with s.get(url) as r:
                    print(str(r.content, encoding='iso-8859-1'))
                return
        content_length = int(r.headers['Content-Length'])
        if decoded_bytes_downloaded>=content_length:
                print('STOP: file already downloaded. decoded_bytes_downloaded>=r.headers[Content-Length]; {}>={}'.format(decoded_bytes_downloaded,r.headers['Content-Length']))
                return
        if decoded_bytes_downloaded>0:
            s.headers['Range'] = 'bytes={}-{}'.format(decoded_bytes_downloaded, content_length-1) #range is inclusive
            print('Retrieving byte range (inclusive) {}-{}'.format(decoded_bytes_downloaded, content_length-1))
        with s.get(url, stream=True) as r:
            r.raise_for_status()
            with open(local_filepath, mode='ab') as fwrite:
                for chunk in r.iter_content(chunk_size=chunk_size):
                    decoded_bytes_downloaded+=len(chunk)
                    decoded_bytes_downloaded_this_session+=len(chunk)
                    time_taken:datetime.timedelta = (datetime.datetime.now() - start_time)
                    seconds_per_byte = time_taken.total_seconds()/decoded_bytes_downloaded_this_session
                    remaining_bytes = content_length-decoded_bytes_downloaded
                    remaining_seconds = seconds_per_byte * remaining_bytes
                    remaining_time = datetime.timedelta(seconds=remaining_seconds)
                    #print updated statistics here
                    fwrite.write(chunk)
                    if not keep_going:
                        break

output_folder = '/mnt/HDD1TB/DownloadsBIG'

# url = 'https://file-examples.com/storage/fea508993d645be1b98bfcf/2017/10/file_example_JPG_100kB.jpg'
# url = 'https://file-examples.com/storage/fe563fce08645a90397f28d/2017/10/file_example_JPG_2500kB.jpg'
url = 'https://ftp.ncbi.nlm.nih.gov/blast/db/nr.00.tar.gz'

local_filepath = os.path.join(output_folder, os.path.split(url)[-1])

download_file(url, local_filepath)
Shannon answered 10/5, 2023 at 19:16 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.