Download large file in python with requests

Asked 22/5, 2013 at 14:47 Answered 10/5, 2023 at 19:16

Solved python download stream python-requests

621

Requests is a really nice library. I'd like to use it for downloading big files (>1GB). The problem is it's not possible to keep whole file in memory; I need to read it in chunks. And this is a problem with the following code:

import requests

def DownloadFile(url)
    local_filename = url.split('/')[-1]
    r = requests.get(url)
    f = open(local_filename, 'wb')
    for chunk in r.iter_content(chunk_size=512 * 1024): 
        if chunk: # filter out keep-alive new chunks
            f.write(chunk)
    f.close()
    return

For some reason it doesn't work this way; it still loads the response into memory before it is saved to a file.

Sat answered 22/5, 2013 at 14:47 Comment(1)

The requests library is nice, but not intended for this purpose. I would suggest using a different library such as urlib3. #17285964 – Wound 3/5, 2023 at 16:4

978

With the following streaming code, the Python memory usage is restricted regardless of the size of the downloaded file:

def download_file(url):
    local_filename = url.split('/')[-1]
    # NOTE the stream=True parameter below
    with requests.get(url, stream=True) as r:
        r.raise_for_status()
        with open(local_filename, 'wb') as f:
            for chunk in r.iter_content(chunk_size=8192): 
                # If you have chunk encoded response uncomment if
                # and set chunk_size parameter to None.
                #if chunk: 
                f.write(chunk)
    return local_filename

Note that the number of bytes returned using iter_content is not exactly the chunk_size; it's expected to be a random number that is often far bigger, and is expected to be different in every iteration.

See body-content-workflow and Response.iter_content for further reference.

Sat answered 22/5, 2013 at 15:52 Comment(41)

@Shuman This code successfully downloads files which are bigger then 1.5Gb. Can you download the file via any browser successfully? – Sat 14/5, 2014 at 11:55

yes in firefox if i download manually, it successfully saves out a 1.5GB .zip file – Orpah 14/5, 2014 at 13:58

@Orpah As I see you resolved the issue when switched from http:// to https:// (github.com/kennethreitz/requests/issues/2043). Can you please update or delete your comments because people may think that there are issues with the code for files bigger 1024Mb – Sat 14/5, 2014 at 18:15

the chunk_size is crucial. by default it's 1 (1 byte). that means that for 1MB it'll make 1 milion iterations. docs.python-requests.org/en/latest/api/… – Hipolitohipp 25/3, 2015 at 13:6

Is it possible to parallelize the iter_content() part somehow to speed up the download? Thanks! – Phytophagous 6/5, 2015 at 11:46

@RomanPodlinov, do you mind telling me why are you using the flush? – Balbriggan 15/5, 2015 at 19:52

you could use url.rsplit('/', 1)[1] as well, which will not split the whole url but only the last part of it. – Petree 27/5, 2015 at 11:27

@RovinBhandari: to parallelize, find out whether there is support for bytes range http header in requests – Porcia 28/9, 2015 at 1:34

url.split('/')[-1] might be too simplistic e.g., see url2filename() – Porcia 28/9, 2015 at 1:35

f.flush() seems unnecessary. What are you trying to accomplish using it? (your memory usage won't be 1.5gb if you drop it). f.write(b'') (if iter_content() may return an empty string) should be harmless and therefore if chunk could be dropped too. – Porcia 28/9, 2015 at 1:40

@J.F.Sebastian agree url2filename is better. About flash. The idea is flush data into physical file on drive. If you see that the code works good without flash() just remove it. – Sat 28/9, 2015 at 18:31

@RomanPodlinov: f.flush() doesn't flush data to physical disk. It transfers the data to OS. Usually, it is enough unless there is a power failure. f.flush() makes the code slower here for no reason. The flush happens when the correponding file buffer (inside app) is full. If you need more frequent writes; pass buf.size parameter to open(). – Porcia 28/9, 2015 at 19:8

@J.F.Sebastian Thank you I commented the flush row in the code – Sat 6/10, 2015 at 14:12

if chunk: # filter out keep-alive new chunks – it is redundant, isn't it? Since iter_content() always yields string and never yields None, it looks like premature optimization. I also doubt it can ever yield empty string (I cannot imagine any reason for this). – Highjack 27/2, 2016 at 5:35

In the case you use dropbox links, it will save your file with a name like "Banner_apus_1.23.zip?dl=1" – Scrivener 17/5, 2016 at 8:16

@paus Double check what you provide as a link. If Dropbox adds something into url (or redirect to other url) you can easy remove it. Just change how you set local_filename variable. – Sat 17/5, 2016 at 11:42

@Highjack Plz put your attention on the comment "filter out keep-alive new chunks" for this line. If you download file with size in several GBs it makes total sense. – Sat 17/5, 2016 at 11:49

@RomanPodlinov I'm not familiar with the term "keep-alive new chunks". Can you explain it a bit further? There are keep-alive (persistent) connections (when several HTTP requests are contained in a single TCP connection) and chunked responses (when there is not Content-Length header and content is divided into chunks, the last one is zero-length). AFAIK, these two features are independent, they have nothing in common. – Highjack 21/5, 2016 at 6:39

@RomanPodlinov Another point: iter_content() always yields string. There is nothing wrong with writing an empty string to file, right. So, why we should check the length? – Highjack 21/5, 2016 at 6:51

@RomanPodlinov And one more point, sorry :) After reading iter_content() sources I've concluded that it cannot ever yield an empty string: there are emptiness checks everywhere. The main logic here: requests/packages/urllib3/response.py. – Highjack 21/5, 2016 at 6:59

But why not shutil.copyfileobj? – Expectorant 17/8, 2016 at 9:32

@Expectorant coz response and response.iter_content is not file-like object? – Peterec 20/9, 2016 at 10:37

@Expectorant Example with shutil.copyfileobj is below by using Response.raw – Peterec 20/9, 2016 at 11:3

@Highjack "I'm not familiar with the term "keep-alive new chunks"." On the one hand I don't know who added this comment into thew code, on the other hand you change my words. This line of code removes empty chunks which appears from time to time probably because of keep-alive requests during download – Sat 1/10, 2016 at 15:14

@RomanPodlinov in regards to the "keep-alive chunks" check that you and y0prst were discussing; was the conclusion that it is unnecessary because requests never returns an empty string thanks to internal checks? – Curtis 27/5, 2017 at 0:13

@RomanPodlinov this line seems to suggest so at least in the case of 'file-like objects': github.com/kennethreitz/requests/blob/… – Curtis 27/5, 2017 at 0:22

@for a 5GB file above code is taking forever, what would be the ideal chunk size to use in this case?anything we can do to improve the speed of downloading – Riella 4/10, 2018 at 0:35

@Riella In your case I recommend to use my small lib github.com/keepitsimple/pyFTPclient it can reconnect and use multiple simultaneous connections for download. I used this small lib for downloading files of size 1-10 GBs – Sat 18/10, 2018 at 10:43

@RomanPodlinov - I couldn't;t adapt pyFTPclien to download from a link, lets say https://hostname.company.com/ui/containers/9888577 ,how would the following lines change to download from a link obj = PyFTPclient('192.168.0.59', 2121, 'test', 'testftp') obj.DownloadFile('USAHD-8974-20131013-0300-0330.ts') – Riella 28/10, 2018 at 18:0

@Riella pyFTPclient was implemented for FTP protocol. – Sat 9/11, 2018 at 18:7

And remember flush after the writing to a file with stream=True if you're trying to get hash / size of the files right after the download - you may be missing a few (hundreds) bytes if you don't . – Trinitrocresol 12/12, 2018 at 8:5

How do you know this is not occupying lots of memory? Looking at the process monitor? When I run: import sys print(sys.getsizeof(r.text)) I get the same size outputted whether I use your stream code above or not – Unreconstructed 25/12, 2018 at 21:42

@newbieI don't know what OS do you use. I use htop under Linux or Process Monitor from SysInternals.com under Windows – Sat 3/1, 2019 at 16:59

to @0xcaff "Don't forget to close the connection with r.close()" - No. it's wrong. with will close connection automatically – Sat 6/5, 2020 at 9:33

@RomanPodlinov It was not using with when I made this comment. – Estaestablish 6/5, 2020 at 19:11

I can't download a zip file of 221MB. The downloaded file size maxes out at 219KB every time i tried this code. – Bayard 21/6, 2020 at 4:44

what if i use f=requests.get(url, stream=True) then for chunk in f.iter_content(chunk_size=8192) without using with, would it work? – Pusillanimous 10/9, 2020 at 7:34

@RomanPodlinov is it right to write directly to disk?. That is, 8192 or 512*1024 bytes are not so bigger then for a 100mb file this will do too much "write" operations.. could be this an issue? how can I handle it? – Herb 28/12, 2020 at 17:32

how can i write the same for post method – Blessington 25/2, 2021 at 18:36

i would suggest os.path.basename(url) to get filename – Pisarik 18/3, 2022 at 2:51

so if 'transfer-encoding' in r.headers.keys(): if 'chunked' in r.headers['transer-encoding']: chunk_size = none – Harrus 25/1, 2023 at 19:24

545

It's much easier if you use Response.raw and shutil.copyfileobj():

import requests
import shutil

def download_file(url):
    local_filename = url.split('/')[-1]
    with requests.get(url, stream=True) as r:
        with open(local_filename, 'wb') as f:
            shutil.copyfileobj(r.raw, f)

    return local_filename

This streams the file to disk without using excessive memory, and the code is simple.

Note: According to the documentation, Response.raw will not decode gzip and deflate transfer-encodings, so you will need to do this manually.

Absorptance answered 30/8, 2016 at 2:13 Comment(25)

Note that you may need to adjust when streaming gzipped responses per issue 2155. – Carpology 29/9, 2016 at 1:15

Have you tested this code for big files download >1gb? – Sat 1/10, 2016 at 15:17

Yes I did. most of files were > 1GB. the code was downloading a bunch of video files on daily basis – Sat 28/12, 2016 at 20:9

THIS should be the correct answer! The accepted answer gets you up to 2-3MB/s. Using copyfileobj gets you to ~40MB/s. Curl downloads (same machines, same url, etc) with ~50-55 MB/s. – Investigator 12/7, 2017 at 7:5

@Investigator how did you check the download speeds? – Evonneevonymus 20/9, 2017 at 15:24

@Evonneevonymus From python, dividing the download time vs the file size. I used fairily larg files (100M-2G) over a gigabit connection. The server was more or less in the same network/datacenter. – Investigator 24/9, 2017 at 8:33

A small caveat for using .raw is that it does not handle decoding. Mentioned in the docs here: docs.python-requests.org/en/master/user/quickstart/… – Kennykeno 17/12, 2017 at 1:3

Is it possible to stream to stdout through the print? – Laryssa 10/4, 2018 at 15:52

@VitalyZdanevich: Try shutil.copyfileobj(r.raw, sys.stdout). – Absorptance 11/4, 2018 at 2:59

@Investigator I was able to match the download speeds between raw and iter_content after I increased chunk_size from 1024 to 10*1024 (debian ISO, regular connection) – Drunkometer 10/8, 2018 at 13:17

The issue with the accepted answer is the chunk size. If you have a sufficiently fast connection, 1KiB is too small, you spend too much time on overhead compared to transferring data. shutil.copyfileobj defaults to 16KiB chunks. Increasing the chunk size from 1KiB will almost certainly increase download rate, but don't increase too much. I am using 1MiB chunks and it works well, it approaches full bandwidth usage. You could try to monitor connection rate and adjust chunk size based on it, but beware premature optimization. – Plumlee 15/1, 2019 at 19:15

@EricCousineau You can patch up this behaviour replacing the read method: response.raw.read = functools.partial(response.raw.read, decode_content=True) – Determinate 27/1, 2019 at 12:39

Is there any way to limit the streaming read here to a max value, say 128 KiB? – Widen 11/2, 2019 at 23:0

Meanwhile it's 2019. I took the freedom to edit the missing @with requests.get(url, stream=True) as r:@ into the answer. There's no reason not to use it. – Trivial 6/6, 2019 at 13:15

@vog, the source code (at least, in the latest requests) already includes the with statement with sessions.Session() as session: return session.request(method=method, url=url, **kwargs) – Rim 10/6, 2019 at 21:39

Adding length param got me better download speeds shutil.copyfileobj(r.raw, f, length=16*1024*1024) – Bedspring 7/2, 2020 at 22:27

For me, I got back an appropriate sized object, but my machine told me the file was corrupt. I am working with pdf files and no application can open what I just downloaded. – Christinachristine 14/4, 2020 at 15:55

just gonna bump this because this was so FAST and simple to download multiple 1GB+ files compared to others – Playful 22/6, 2020 at 16:40

for me this results in an invalid tarball: gzip: stdin: not in gzip format but if I download it via browser the tar format is gzip. – Snavely 12/2, 2021 at 10:46

Updated link to github issue 2155 about streaming gzipped responses (the link in ChrisP's answer no longer works). – Oligopoly 7/5, 2021 at 22:36

it seems to me that shutil.copyfileobj is returning before the download is complete. Is there a way of blocking until the file has completely downloaded? – Coffeepot 12/10, 2021 at 0:30

@Coffeepot Are you perhaps seeing some delay in your filesystem? shutil.copyfileobj doesn't exactly return before completion, but your filesystem may have some delay before readers observe the file being completely written. – Absorptance 15/10, 2021 at 1:49

@JohnZwinck yes that could be it. I couldn't figure out an elegant way to check that the full file had been written, but I haven't seen any issues since I added a simple sleep. – Coffeepot 15/10, 2021 at 5:3

I like this approach better, but how would I implement tqdm with this one? – Negotiate 21/4, 2022 at 4:41

@SourceMatters: if a progress bar is important to you, this solution won't be the most straightforward. – Absorptance 23/4, 2022 at 5:57

114

Not exactly what OP was asking, but... it's ridiculously easy to do that with urllib:

from urllib.request import urlretrieve

url = 'http://mirror.pnl.gov/releases/16.04.2/ubuntu-16.04.2-desktop-amd64.iso'
dst = 'ubuntu-16.04.2-desktop-amd64.iso'
urlretrieve(url, dst)

Or this way, if you want to save it to a temporary file:

from urllib.request import urlopen
from shutil import copyfileobj
from tempfile import NamedTemporaryFile

url = 'http://mirror.pnl.gov/releases/16.04.2/ubuntu-16.04.2-desktop-amd64.iso'
with urlopen(url) as fsrc, NamedTemporaryFile(delete=False) as fdst:
    copyfileobj(fsrc, fdst)

I watched the process:

watch 'ps -p 18647 -o pid,ppid,pmem,rsz,vsz,comm,args; ls -al *.iso'

And I saw the file growing, but memory usage stayed at 17 MB. Am I missing something?

Eggert answered 5/6, 2017 at 22:13 Comment(2)

For Python 2.x, use from urllib import urlretrieve – Harv 9/4, 2018 at 14:19

This function "might become deprecated at some point in the future." cf. docs.python.org/3/library/urllib.request.html#legacy-interface – Commeasure 8/4, 2022 at 11:28

Your chunk size could be too large, have you tried dropping that - maybe 1024 bytes at a time? (also, you could use with to tidy up the syntax)

def DownloadFile(url):
    local_filename = url.split('/')[-1]
    r = requests.get(url)
    with open(local_filename, 'wb') as f:
        for chunk in r.iter_content(chunk_size=1024): 
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)
    return

Incidentally, how are you deducing that the response has been loaded into memory?

It sounds as if python isn't flushing the data to file, from other SO questions you could try f.flush() and os.fsync() to force the file write and free memory;

    with open(local_filename, 'wb') as f:
        for chunk in r.iter_content(chunk_size=1024): 
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)
                f.flush()
                os.fsync(f.fileno())

Phoney answered 22/5, 2013 at 15:2 Comment(6)

I use System Monitor in Kubuntu. It shows me that python process memory increases (up to 1.5gb from 25kb). – Sat 22/5, 2013 at 15:22

That memory bloat sucks, maybe f.flush(); os.fsync() might force a write an memory free. – Phoney 22/5, 2013 at 15:39

it's os.fsync(f.fileno()) – Paternalism 10/10, 2014 at 23:40

You need to use stream=True in the requests.get() call. That's what's causing the memory bloat. – Flannel 10/5, 2015 at 21:59

minor typo: you miss a colon (':') after def DownloadFile(url) – Slain 4/1, 2017 at 15:43

What if i don't want to save it as a file but in BytesIO ??? – Priestly 28/6, 2021 at 14:26

use wget module of python instead. Here is a snippet

import wget
wget.download(url)

Eiland answered 19/10, 2020 at 4:9 Comment(2)

This is a very old an unmaitained module. – Fournier 7/12, 2022 at 12:37

The OP is specifically asking how to do it in python with requests. Jumping out of python space is not usually an option. – Brawn 7/6, 2023 at 0:20

Based on the Roman's most upvoted comment above, here is my implementation, Including "download as" and "retries" mechanism:

def download(url: str, file_path='', attempts=2):
    """Downloads a URL content into a file (with large file support by streaming)

    :param url: URL to download
    :param file_path: Local file name to contain the data downloaded
    :param attempts: Number of attempts
    :return: New file path. Empty string if the download failed
    """
    if not file_path:
        file_path = os.path.realpath(os.path.basename(url))
    logger.info(f'Downloading {url} content to {file_path}')
    url_sections = urlparse(url)
    if not url_sections.scheme:
        logger.debug('The given url is missing a scheme. Adding http scheme')
        url = f'http://{url}'
        logger.debug(f'New url: {url}')
    for attempt in range(1, attempts+1):
        try:
            if attempt > 1:
                time.sleep(10)  # 10 seconds wait time between downloads
            with requests.get(url, stream=True) as response:
                response.raise_for_status()
                with open(file_path, 'wb') as out_file:
                    for chunk in response.iter_content(chunk_size=1024*1024):  # 1MB chunks
                        out_file.write(chunk)
                logger.info('Download finished successfully')
                return file_path
        except Exception as ex:
            logger.error(f'Attempt #{attempt} failed with error: {ex}')
    return ''

Insolent answered 5/7, 2020 at 17:15 Comment(0)

Here is additional approach for the use-case of async chunked download, without reading all the file content to memory.
It means that both read from the URL and the write to file are implemented with asyncio libraries (aiohttp to read from the URL and aiofiles to write the file).

The following code should work on Python 3.7 and later.
Just edit SRC_URL and DEST_FILE variables before copy and paste.

import aiofiles
import aiohttp
import asyncio

async def async_http_download(src_url, dest_file, chunk_size=65536):
    async with aiofiles.open(dest_file, 'wb') as fd:
        async with aiohttp.ClientSession() as session:
            async with session.get(src_url) as resp:
                async for chunk in resp.content.iter_chunked(chunk_size):
                    await fd.write(chunk)

SRC_URL = "/path/to/url"
DEST_FILE = "/path/to/file/on/local/machine"

asyncio.run(async_http_download(SRC_URL, DEST_FILE))

Ultramodern answered 1/8, 2022 at 13:53 Comment(0)

`requests` is good, but how about `socket` solution?

def stream_(host):
    import socket
    import ssl
    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as sock:
        context = ssl.create_default_context(Purpose.CLIENT_AUTH)
        with context.wrap_socket(sock, server_hostname=host) as wrapped_socket:
            wrapped_socket.connect((socket.gethostbyname(host), 443))
            wrapped_socket.send(
                "GET / HTTP/1.1\r\nHost:thiscatdoesnotexist.com\r\nAccept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9\r\n\r\n".encode())

            resp = b""
            while resp[-4:-1] != b"\r\n\r":
                resp += wrapped_socket.recv(1)
            else:
                resp = resp.decode()
                content_length = int("".join([tag.split(" ")[1] for tag in resp.split("\r\n") if "content-length" in tag.lower()]))
                image = b""
                while content_length > 0:
                    data = wrapped_socket.recv(2048)
                    if not data:
                        print("EOF")
                        break
                    image += data
                    content_length -= len(data)
                with open("image.jpeg", "wb") as file:
                    file.write(image)

Wash answered 2/10, 2021 at 19:19 Comment(4)

I'm curious what's the advantange of using this instead of a higher level (and well tested) method from libs like requests? – Airtoair 21/4, 2022 at 22:18

Libs like requests are full of abstraction above the native sockets. That's not the best algorithm, but it could be faster because of no abstraction at all. – Wash 7/5, 2022 at 21:0

It appears this loads the whole content into memory in the "image" variable, then writes it to a file. How does this work for large files with local memory constraints? – Shannon 10/5, 2023 at 17:10

Yeah, you can just modify this if you want. For example: change the last part with image variable and write to file itself instead of variable – Wash 1/6, 2023 at 8:38

Yet another option for downloading large files. This will allow you to stop and continue later (press the Enter key to stop), and continue from where you left off if your connection gets dropped otherwise.

import datetime
import os
import requests
import threading as th

keep_going = True
def key_capture_thread():
    global keep_going
    input()
    keep_going = False
pkey_capture = th.Thread(target=key_capture_thread, args=(), name='key_capture_process', daemon=True).start()

def download_file(url, local_filepath):
    #assumptions:
    #  headers contain Content-Length:
    #  headers contain Accept-Ranges: bytes
    #  stream is not encoded (otherwise start bytes are not known, unless this is stored seperately)
    
    chunk_size = 1048576 #1MB
    # chunk_size = 8096 #8KB
    # chunk_size = 1024 #1KB
    decoded_bytes_downloaded_this_session = 0
    start_time = datetime.datetime.now()
    if os.path.exists(local_filepath):
        decoded_bytes_downloaded = os.path.getsize(local_filepath)
    else:
        decoded_bytes_downloaded = 0
    with requests.Session() as s:
        with s.get(url, stream=True) as r:
            #check for required headers:
            if 'Content-Length' not in r.headers:
                print('STOP: request headers do not contain Content-Length')
                return
            if ('Accept-Ranges','bytes') not in r.headers.items():
                print('STOP: request headers do not contain Accept-Ranges: bytes')
                with s.get(url) as r:
                    print(str(r.content, encoding='iso-8859-1'))
                return
        content_length = int(r.headers['Content-Length'])
        if decoded_bytes_downloaded>=content_length:
                print('STOP: file already downloaded. decoded_bytes_downloaded>=r.headers[Content-Length]; {}>={}'.format(decoded_bytes_downloaded,r.headers['Content-Length']))
                return
        if decoded_bytes_downloaded>0:
            s.headers['Range'] = 'bytes={}-{}'.format(decoded_bytes_downloaded, content_length-1) #range is inclusive
            print('Retrieving byte range (inclusive) {}-{}'.format(decoded_bytes_downloaded, content_length-1))
        with s.get(url, stream=True) as r:
            r.raise_for_status()
            with open(local_filepath, mode='ab') as fwrite:
                for chunk in r.iter_content(chunk_size=chunk_size):
                    decoded_bytes_downloaded+=len(chunk)
                    decoded_bytes_downloaded_this_session+=len(chunk)
                    time_taken:datetime.timedelta = (datetime.datetime.now() - start_time)
                    seconds_per_byte = time_taken.total_seconds()/decoded_bytes_downloaded_this_session
                    remaining_bytes = content_length-decoded_bytes_downloaded
                    remaining_seconds = seconds_per_byte * remaining_bytes
                    remaining_time = datetime.timedelta(seconds=remaining_seconds)
                    #print updated statistics here
                    fwrite.write(chunk)
                    if not keep_going:
                        break

output_folder = '/mnt/HDD1TB/DownloadsBIG'

# url = 'https://file-examples.com/storage/fea508993d645be1b98bfcf/2017/10/file_example_JPG_100kB.jpg'
# url = 'https://file-examples.com/storage/fe563fce08645a90397f28d/2017/10/file_example_JPG_2500kB.jpg'
url = 'https://ftp.ncbi.nlm.nih.gov/blast/db/nr.00.tar.gz'

local_filepath = os.path.join(output_folder, os.path.split(url)[-1])

download_file(url, local_filepath)

Shannon answered 10/5, 2023 at 19:16 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

requests is good, but how about socket solution?

Recommended topics

Hot tags

`requests` is good, but how about `socket` solution?