how to determine the filename of content downloaded with HTTP in Python?
Asked Answered
B

3

29

I download a file using the get function of Python requests library. For storing the file, I'd like to determine the filename the way a web browser would for its 'save' or 'save as ...' dialog.

Easy, right? I can just get it from the Content-Disposition HTTP header, accessible on the response object:

import re
d = r.headers['content-disposition']
fname = re.findall("filename=(.+)", d)

But looking more closely at this topic, it isn't that easy:

According to RFC 6266 section 4.3, and the grammar in the section 4.1, the value can be an unquoted token (e.g. the_report.pdf) or a quoted string that can also contain whitespace (e.g. "the report.pdf") and escape sequences. Further,

when both "filename" and "filename*" are present in a single header field value, [we] SHOULD pick "filename*" and ignore "filename".

The value of filename*, though, is yet a bit more complicated than the one of filename.

Also, the RFC seems to allow for additional whitespace around the =.

Thus, for the examples listed in the RFC, I'd want the following results:

  •   Content-Disposition: Attachment; filename=example.html
    
    filename: example.html
  •   Content-Disposition: INLINE; FILENAME= "an example.html"
    
    filename: an example.html
  •   Content-Disposition: attachment;
                           filename*= UTF-8''%e2%82%ac%20rates
    
    filename: € rates
  •   Content-Disposition: attachment;
                           filename="EURO rates";
                           filename*=utf-8''%e2%82%ac%20rates
    
    filename: € rates here, too (not EURO rates, as filename* takes precedence)

Now, I could easily adapt the regular expression to account for variable whitespace around the =, but having it handle all the other variations, too, would get rather unwieldy. (With the quoting and escaping, I'm not even sure RegEx can cover all the cases. Maybe they can, as there is no brace-nesting involved.)

So do I have to implement a full-blown parser, or can I determine filename according to RFC 6266 by some few calls to a HTTP library (maybe requests itself)? As RFC 6266 is part of the HTTP standard, I could imagine that some libraries specialized on HTTP already cover this. (So I've also asked on Software Recommendations SE.)

Bret answered 5/5, 2016 at 21:11 Comment(0)
F
18

The rfc6266 library appears to do exactly what you need. It can parse raw headers, requests responses, and urllib2 responses. It's on PyPI.

Some examples:

>>> import rfc6266, requests
>>> rfc6266.parse_headers('''Attachment; filename=example.html''').filename_unsafe
'example.html'
>>> rfc6266.parse_headers('''INLINE; FILENAME= "an example.html"''').filename_unsafe
'an example.html'
>>> rfc6266.parse_headers(
    '''attachment; '''
    '''filename*= UTF-8''%e2%82%ac%20rates''').filename_unsafe
'€ rates'
>>> rfc6266.parse_headers(
    '''attachment; '''
    '''filename="EURO rates"; '''
    '''filename*=utf-8''%e2%82%ac%20rates''').filename_unsafe
'€ rates'
>>> r = requests.get('http://example.com/€ rates')
>>> rfc6266.parse_requests_response(r).filename_unsafe
'€ rates'

As a note, though: this library does not like nonstandard whitespace in the header.

Fornication answered 5/5, 2016 at 21:40 Comment(2)
What do you mean by "nonstandard whitespace"? Whitespace at places where the standard doesn't allow whitespace? Or UNICODE whitespace that isn't part of 7-bit ASCII?Bret
@Bret Haven't investigated enough to tell you for sure. Turns out parse_headers has a relaxed option that helps with this. Check out the code here.Fornication
D
8

In 2022, it seems like the Python module rfc6266 recommended in the original answer has been abandoned and doesn't really work with the newer versions of Python.

The good news is that there is a replacement module (One of several, but this one actually works!) called pyrfc6266

It can be installed with the following:

pip install pyrfc6266

and used the same way:

import pyrfc6266
pyrfc6266.parse_filename('attachment; filename="foo.html"')

or

import requests
import pyrfc6266
response = requests.get('http://httpbin.org/response-headers?Content-Disposition=attachment;%20filename%3d%22foo.html%22')
pyrfc6266.requests_response_to_filename(response)
Duchess answered 19/8, 2022 at 15:22 Comment(2)
Also notable that this is MIT licensed while the abandoned lib I provided above is LGPLFornication
to also get the attachment string: value, params = pyrfc6266.parse(content_disposition); assert value == "attachment"; content_filename = next(map(lambda p: p.value, filter(lambda p: p.name == "filename", params)))Dwarfish
V
2

if you don't really need the result in utf-8

def getFilename(s):
  fname = re.findall("filename\*?=([^;]+)", s, flags=re.IGNORECASE)
  print fname[0].strip().strip('"')

but if utf-8 is a must

def getFilename(s):
    fname = re.findall("filename\*=([^;]+)", s, flags=re.IGNORECASE)
    if not fname:
        fname = re.findall("filename=([^;]+)", s, flags=re.IGNORECASE)
    if "utf-8''" in fname[0].lower():
        fname = re.sub("utf-8''", '', fname[0], flags=re.IGNORECASE)
        fname = urllib.unquote(fname).decode('utf8')
    else:
        fname = fname[0]
    # clean space and double quotes
    print fname.strip().strip('"')

# example
getFilename('Attachment; filename=example.html')
getFilename('INLINE; FILENAME= "an example.html"')

getFilename("attachment;filename*= UTF-8''%e2%82%ac%20rates")
getFilename("attachment; filename=\"EURO rates\";filename*=utf-8''%e2%82%ac%20rates")

getFilename("attachment;filename=\"_____ _____ ___ __ ____ _____ Hekayt Bent.2017.mp3\";filename*=UTF-8''%D8%A7%D8%BA%D9%86%D9%8A%D9%87%20%D8%AD%D9%83%D8%A7%D9%8A%D8%A9%20%D8%A8%D9%86%D8%AA%20%D9%84%D9%80%20%D9%85%D8%AD%D9%85%D8%AF%20%D8%B4%D8%AD%D8%A7%D8%AA%D8%A9%20Hekayt%20Bent.2017.mp3")

result

example.html
an example.html
€ rates
€ rates
اغنيه حكاية بنت لـ محمد شحاتة Hekayt Bent.2017.mp3
Vilify answered 28/7, 2018 at 10:27 Comment(1)
If the string utf-8 is not in the beginning should it be treated differently? If the header is "attachment;filename*= UTF-8''%e2%82%ac%20rates UTF-8'' here" or `"attachment;filename*= @UTF-8''%e2%82%ac%20rates @UTF-8'' here"?Chipboard

© 2022 - 2024 — McMap. All rights reserved.