How to Download only the first x bytes of data Python
Asked Answered
D

2

6

Situation: The file to be downloaded is a large file (>100MB). It takes quite some time, especially with slow internet connection.

Problem: However, I just need the file header (the first 512 bytes), which will decide if the whole file needs to be downloaded or not.

Question: Is there a way to do download only the first 512 bytes of a file?

Additional information: Currently the download is done using urllib.urlretrieve in Python2.7

Danonorwegian answered 15/1, 2018 at 6:34 Comment(5)
I would take wget apart and modify it so it stops before the end.Coppice
Are you able to use the HTTP HEAD method? That returns only the headers.Rootless
@user2896976 Those are for the HTTP Headers I believe? I need the file headers, which is in the first 512 bytes in the file.Danonorwegian
@Jean-FrançoisFabre Would love to do that too but with my skills I think I will get murdered by my teacher before I am done HAHAHA. But thanks though - didn't think of thatDanonorwegian
Range request? https://mcmap.net/q/690424/-only-download-a-part-of-the-document-using-python-requestsBerth
Z
2

I think curl and head would work better than a Python solution here:

curl https://my.website.com/file.txt | head -c 512 > header.txt

EDIT: Also, if you absolutely must have it in a Python script, you can use subprocess to perform the curl piped to head command execution

EDIT 2: For a fully Python solution: The urlopen function (urllib2.urlopen in Python 2, and urllib.request.urlopen in Python 3) returns a file-like stream that you can use the read function on, which allows you to specify a number of bytes. For example, urllib2.urlopen(my_url).read(512) will return the first 512 bytes of my_url

Zaremski answered 15/1, 2018 at 6:40 Comment(4)
Ah yes. The edit was what I needed. But no Python modules can do this?Danonorwegian
The urlopen function (urllib2.urlopen in Python 2, and urllib.request.urlopen in Python 3) returns a file-like stream that you can use the read function on, which allows you to specify a number of bytes. For example, urllib2.urlopen(my_url).read(512) will return the first 512 bytes of my_url. However, I'm not certain this will only download 512 bytes, or if it will try to download the entire file behind-the-scenes and just return the first 512Zaremski
the one in the comment works. do you want to replace it and let me accept as answer?Danonorwegian
Might I add on that urllib also has the same module. If you choose to lessen the number of libraries you are importing. (I have imported urllib and was actually hesitant to import urllib2)Danonorwegian
R
0

If the url you are trying to read responds with Content-Length header, then you can get the file size with urllib2 in Python 2.

def get_file_size(url):
    request = urllib2.Request(url)
    request.get_method = lambda : 'HEAD'
    response = urllib2.urlopen(request)
    length = response.headers.getheader("Content-Length")
    return int(length)

The function can be called to get the length and compared with some threshold value to decide whether to download or not.

if get_file_size("http://stackoverflow.com") < 1000000:
    # Download

(Note that the Python 3 implimentation differs slightly:)

from urllib import request

def get_file_size(url):
    r = request.Request(url)
    r.get_method = lambda : 'HEAD'
    response = request.urlopen(r)
    length = response.getheader("Content-Length")
    return int(length)
Runstadler answered 15/1, 2018 at 6:54 Comment(1)
Love the idea, but I need to compare its hash values that is the one present in the file header. The file size can be the same but its contents may be different. Therefore the hash value is more reliable as a check than file size.Danonorwegian

© 2022 - 2024 — McMap. All rights reserved.