Python download without supplying a filename
Asked Answered
L

8

19

How do I download a file with progress report using python but without supplying a filename.

I have tried urllib.urlretrieve but I seem to have to supply a filename for the downloaded file to save as.

So for example:

I don't want to supply this:

urllib.urlretrieve("http://www.mozilla.com/products/download.html?product=firefox-3.6.3&os=win&lang=en-US", "/tmp/firefox.exe")

just this:

urllib.urlretrieve("http://www.mozilla.com/products/download.html?product=firefox-3.6.3&os=win&lang=en-US", "/tmp/")

but if I do I get this error:

IOError: [Errno 21] Is a directory: '/tmp'

Also unable to get the filename from some URL Example:

http://www.mozilla.com/products/download.html?product=firefox-3.6.3&os=win&lang=en-US

Louanne answered 8/5, 2010 at 19:28 Comment(3)
How can you download something if you don't know what to download? You need some identifier. Please clarify your question.Dieter
Sorry I mean a filename for the download to save as. I know the URL. I hope that makes sense.Louanne
Not sure if i understand your question: you want to extract a filename from a given URL and then use that as the filename below a user-defined directory ?Cullender
I
35

Here is a complete way to do it with python3 and no filename specified in the url:

from urllib.request import urlopen, urlretrieve
import cgi

url = "http://cloud.ine.ru/s/JDbPr6W4QXnXKgo/download"
remotefile = urlopen(url)
contentdisposition = remotefile.info()['Content-Disposition']
_, params = cgi.parse_header(contentdisposition)
filename = params["filename"]
urlretrieve(url, filename)

In the result you should get cargo_live_animals_parrot.jpg file

Improvement answered 9/4, 2018 at 12:58 Comment(4)
This is the best answer as it takes into account the fact that the server can choose a filename that is completely different from the URL.Louvre
Would there be any option available if we are stuck on python 2.7?Aguila
I think urllib exists for python2.7 (maybe in pip), not sure about cgi, but very possibleDifferential
Something to keep in mind: the Content-Disposition header isn't always there. For example, OpenSSL's openssl.org/source/old/1.1.1/openssl-1.1.1q.tar.gz doesn't give the header. I guess you have to fall back on parsing your URL string if the server doesn't give you the header.Habitual
C
12

edited after the question was clarified...

urlparse.urlsplit will take the url that you are opening and split it into its component parts, then you can take the path portion and use the last /-delimited chunk as the filename.

import urllib, urlparse

split = urlparse.urlsplit(url)
filename = "/tmp/" + split.path.split("/")[-1]
urllib.urlretrieve(url, filename)
Chamorro answered 8/5, 2010 at 19:52 Comment(2)
The problem is that this url mozilla.com/products/… doesn't contain a filename. Thanks for your Reply!Louanne
so how did you solve it when the image URL doesn't contain an extension?Boxthorn
G
2

There is urlopen, which creates a file-like object that can be used to read the data without saving it to a local file:

from urllib2 import urlopen

f = urlopen("http://example.com/")
for line in f:
  print len(line)
f.close()

(I'm not really sure if this is what you're asking for.)

Girdler answered 8/5, 2010 at 19:44 Comment(1)
Not quite, I have just edit my question with an example hope this helps Thank for the replyLouanne
G
2

The URL you're specifying doesn't refer to a file at all. It's a redirect to a web page, that runs some javascript, that causes your web browser to download the file. The actual address my browser was directed to (a mirror) from the URL in question is:

http://mozilla.mirrors.evolva.ro//firefox/releases/3.6.3/win32/en-US/Firefox%20Setup%203.6.3.exe

I believe that there are two ways that web servers specify the name of the file for downloads;

  1. The final segment of the URL path
  2. The Content-Disposition header, which can specify some other filename to use

For the file you want to download I think you only need the last path segment of the URL (but using the actual URL of the file, not the web page that chooses which mirrored file to use). But for some downloads you'd need to get the filename to use from the Content-Disposition header.

Guide answered 8/5, 2010 at 20:47 Comment(0)
H
1

I ended up with

os.system('wget -P /tmp http://www.mozilla.com/products/download.html?'
          'product=firefox-3.6.3&os=win&lang=en-US')
Hetero answered 15/5, 2016 at 20:32 Comment(1)
probably you should add --trust-server-names switch so wget will use Content-Disposition provided name.Outlet
D
1
import shutil
import urllib.parse
import urllib.request
import os

urls = {
    'just_filename' : 'https://github.com/bits4waves/100daysofpractice-dataset/raw/master/requirements.txt',
    'filename_with_params' : 'https://github.com/bits4waves/resonometer/blob/master/sound/violin-A-pluck.wav?raw=true',
    'no_filename' : 'https://download.mozilla.org/?product=firefox-latest-ssl&os=linux64&lang=en-US',
}

for url in urls.values():
    with urllib.request.urlopen(url) as response:
        parsed_url_path = urllib.parse.urlparse(response.url).path
        filename = os.path.basename(parsed_url_path)
        with open(filename, 'w+b') as f:
            shutil.copyfileobj(response, f)
Depression answered 14/5, 2021 at 17:44 Comment(0)
T
0

A quick look at the javascript on the firefox page reveals:

// 2. Build download.mozilla.org URL out of those vars.
download_url = "http://download.mozilla.org/?product=";
download_url += product + '&os=' + os + '&lang=' + lang;

So just change your url from:

http://www.mozilla.com/products/download.html?product=firefox-3.6.3&os=win&lang=en-US

to

http://download.mozilla.org/?product=firefox-3.6.3&os=win&lang=en-US

So now I will check the headers to see what we really get...

$ curl -I "http://download.mozilla.org/?product=firefox-3.6.3&os=win&lang=en-US"
HTTP/1.1 302 Found
Server: Apache
X-Backend-Server: pp-app-dist09
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0, private
Content-Type: text/html; charset=UTF-8
Date: Sat, 08 May 2010 21:02:50 GMT
Location: http://mozilla.mirror.ac.za/firefox/releases/3.6.3/win32/en-US/Firefox Setup 3.6.3.exe
Pragma: no-cache
Transfer-Encoding: chunked
Connection: Keep-Alive
Set-Cookie: dmo=10.8.84.200.1273352570769772; path=/; expires=Sun, 08-May-11 21:02:50 GMT
X-Powered-By: PHP/5.1.6

So this actually is a 302 redirect, so now use what is in the Location header as your new url to get the actual file. You'll need to figure out how to do a request and read the headers on your own(sorry I don't have much time). After you parse the location header, you can then strip out the rest of the location using regex to get the filename to save the file to as well:

>>> location = 'http://mozilla.mirror.ac.za/firefox/releases/3.6.3/win32/en-US/Firefox Setup 3.6.3.exe'
>>> re.match('^.*/(.*?)$', location).groups()[0]
'Firefox Setup 3.6.3.exe'

So to get the actual filename you will need to follow the 302 yourself. The code necessary for this I will leave up to you, but hopefully this will point you in the right direction.

Tincture answered 8/5, 2010 at 21:15 Comment(0)
A
0

urlgrabber.urlgrab() will use the basename of the URL passed to it as the filename. Note that it will ignore the Content-Disposition header.

Averell answered 8/5, 2010 at 21:18 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.