How to get filename from Content-Disposition in headers
Asked Answered
T

4

35

I am downloading a file with Mechanize and in response headers there is a string:

Content-Disposition: attachment; filename=myfilename.txt

Is there a quick standard way to get that filename value? What I have in mind now is this:

filename = f[1]['Content-Disposition'].split('; ')[1].replace('filename=', '')

But it looks like a quick'n'dirty solution.

Toodleoo answered 7/11, 2011 at 11:34 Comment(2)
Just as a warning, the filename can be quoted (like most message headers) and have escape sequences. So quick string hacks might lead to problems.Sottish
Check https://mcmap.net/q/450034/-how-to-determine-the-filename-of-content-downloaded-with-http-in-python/1136400Greensboro
O
49

First get the value of the header by using mechanize, then parse the header using the builtin cgi module.

To demonstrate:

>>> import mechanize
>>> browser = mechanize.Browser()
>>> response = browser.open('http://example.com/your/url')
>>> info = response.info()
>>> header = info.getheader('Content-Disposition')
>>> header
'attachment; filename=myfilename.txt'

The header value can then be parsed:

>>> import cgi               
>>> value, params = cgi.parse_header(header)
>>> value
'attachment'
>>> params
{'filename': 'myfilename.txt'}

params is a simple dict so params['filename'] is what you need. It doesn't matter whether the filename is wrapped in quotes or not.

Ontologism answered 8/1, 2015 at 12:42 Comment(2)
Not that this doesn't work if your file name is encoded, in which case the parames would contains 'filename*' instead of 'filename' and you would need to unquote and decode the filename into a unicode string.Hf
filename*=utf-8''file.txt is nonstandard, see also support encoded filename in Content-Disposition and pyrfc6266 via how to determine the filename of content downloaded with HTTP in Python?Norward
H
9

These regular expressions are based on the grammar from RFC 6266, but modified to accept headers without disposition-type, e.g. Content-Disposition: filename=example.html

i.e. [ disposition-type ";" ] disposition-parm ( ";" disposition-parm )* / disposition-type

It will handle filename parameters with and without quotes, and unquote quoted pairs from values in quotes, e.g. filename="foo\"bar" -> foo"bar

It will handle filename* extended parameters and prefer a filename* extended parameter over a filename parameter regardless of the order they appear in the header

It strips folder name information, e.g. /etc/passwd -> passwd, and it defaults to the basename from the URL path in the absence of a filename parameter (or header, or if the parameter value is empty string)

The token and qdtext regular expressions are based on the grammar from RFC 2616, the mimeCharset and valueChars regular expressions are based on the grammar from RFC 5987, and the language regular expression is based on the grammar from RFC 5646

import re, urllib
from os import path
from urlparse import urlparse

# content-disposition = "Content-Disposition" ":"
#                        disposition-type *( ";" disposition-parm )
# disposition-type    = "inline" | "attachment" | disp-ext-type
#                     ; case-insensitive
# disp-ext-type       = token
# disposition-parm    = filename-parm | disp-ext-parm
# filename-parm       = "filename" "=" value
#                     | "filename*" "=" ext-value
# disp-ext-parm       = token "=" value
#                     | ext-token "=" ext-value
# ext-token           = <the characters in token, followed by "*">

token = '[-!#-\'*+.\dA-Z^-z|~]+'
qdtext='[]-~\t !#-[]'
mimeCharset='[-!#-&+\dA-Z^-z]+'
language='(?:[A-Za-z]{2,3}(?:-[A-Za-z]{3}(?:-[A-Za-z]{3}){,2})?|[A-Za-z]{4,8})(?:-[A-Za-z]{4})?(?:-(?:[A-Za-z]{2}|\d{3}))(?:-(?:[\dA-Za-z]{5,8}|\d[\dA-Za-z]{3}))*(?:-[\dA-WY-Za-wy-z](?:-[\dA-Za-z]{2,8})+)*(?:-[Xx](?:-[\dA-Za-z]{1,8})+)?|[Xx](?:-[\dA-Za-z]{1,8})+|[Ee][Nn]-[Gg][Bb]-[Oo][Ee][Dd]|[Ii]-[Aa][Mm][Ii]|[Ii]-[Bb][Nn][Nn]|[Ii]-[Dd][Ee][Ff][Aa][Uu][Ll][Tt]|[Ii]-[Ee][Nn][Oo][Cc][Hh][Ii][Aa][Nn]|[Ii]-[Hh][Aa][Kk]|[Ii]-[Kk][Ll][Ii][Nn][Gg][Oo][Nn]|[Ii]-[Ll][Uu][Xx]|[Ii]-[Mm][Ii][Nn][Gg][Oo]|[Ii]-[Nn][Aa][Vv][Aa][Jj][Oo]|[Ii]-[Pp][Ww][Nn]|[Ii]-[Tt][Aa][Oo]|[Ii]-[Tt][Aa][Yy]|[Ii]-[Tt][Ss][Uu]|[Ss][Gg][Nn]-[Bb][Ee]-[Ff][Rr]|[Ss][Gg][Nn]-[Bb][Ee]-[Nn][Ll]|[Ss][Gg][Nn]-[Cc][Hh]-[Dd][Ee]'
valueChars = '(?:%[\dA-F][\dA-F]|[-!#$&+.\dA-Z^-z|~])*'
dispositionParm = '[Ff][Ii][Ll][Ee][Nn][Aa][Mm][Ee]\s*=\s*(?:({token})|"((?:{qdtext}|\\\\[\t !-~])*)")|[Ff][Ii][Ll][Ee][Nn][Aa][Mm][Ee]\*\s*=\s*({mimeCharset})\'(?:{language})?\'({valueChars})|{token}\s*=\s*(?:{token}|"(?:{qdtext}|\\\\[\t !-~])*")|{token}\*\s*=\s*{mimeCharset}\'(?:{language})?\'{valueChars}'.format(**locals())

try:
  m = re.match('(?:{token}\s*;\s*)?(?:{dispositionParm})(?:\s*;\s*(?:{dispositionParm}))*|{token}'.format(**locals()), result.headers['Content-Disposition'])

except KeyError:
  name = path.basename(urllib.unquote(urlparse(url).path))

else:
  if not m:
    name = path.basename(urllib.unquote(urlparse(url).path))

  # Many user agent implementations predating this specification do not
  # understand the "filename*" parameter.  Therefore, when both "filename"
  # and "filename*" are present in a single header field value, recipients
  # SHOULD pick "filename*" and ignore "filename"

  elif m.group(8) is not None:
    name = urllib.unquote(m.group(8)).decode(m.group(7))

  elif m.group(4) is not None:
    name = urllib.unquote(m.group(4)).decode(m.group(3))

  elif m.group(6) is not None:
    name = re.sub('\\\\(.)', '\1', m.group(6))

  elif m.group(5) is not None:
    name = m.group(5)

  elif m.group(2) is not None:
    name = re.sub('\\\\(.)', '\1', m.group(2))

  else:
    name = m.group(1)

  # Recipients MUST NOT be able to write into any location other than one to
  # which they are specifically entitled

  if name:
    name = path.basename(name)

  else:
    name = path.basename(urllib.unquote(urlparse(url).path))
Harbison answered 18/8, 2012 at 9:47 Comment(1)
Alternatively the regular expressions can be simplified by not validating the language tag, especially since it's ignored. The language tag can contain an unbounded number of hyphens, numbers, and letters only, and it's optional. So just accept [-\dA-Za-z]* dispositionParm = '[Ff][Ii][Ll][Ee][Nn][Aa][Mm][Ee]\s*=\s*(?:({token})|"((?:{qdtext}|\\\[\t !-~])*)")|[Ff][Ii][Ll][Ee][Nn][Aa][Mm][Ee]*\s*=\s*({mimeCharset})\'[-\dA-Za-z]*\'({valueChars})|{token}\s*=\s*(?:{token}|"(?:{qdtext}|\\\[\t !-~])*")|{token}*\s*=\s*{mimeCharset}\'[-\dA-Za-z]*\'{valueChars}'.format(**locals())Harbison
D
2

The cgi module recommended in the top answer is slated for deprecation in Python 3.13. See PEP594.

The official recommendation (see referenced PEP) for a standard library alternative to cgi.parse_header is email.message.Message.

And here is an example of how to get the filename value from a content-disposition header:

>>> from email.message import Message
>>> content_disposition_header = 'attachment; filename=myfilename.txt'
>>> msg = Message()
>>> msg['content-disposition'] = content_disposition_header
>>> msg.get_filename()
'myfilename.txt'

Note: The above should work across all versions of python >= 2.2. The email package was introduced in Python 2.2, at which point the Message class API was identical to its current form insofar as the __setitem__ and get_filename methods are concerned.

Demerit answered 28/2 at 10:1 Comment(0)
C
1

I would try something like:

import re
filename = re.findall("filename=(\S+)", f[1]['Content-Disposition'])

This handles quotes and URL escaping on the filenames.

Contradance answered 7/11, 2011 at 18:18 Comment(1)
But this returns a list, not a string, so you probably want filename[0] or something. Also it returns the quotes as part of the filename. So not really a working example.Redblooded

© 2022 - 2024 — McMap. All rights reserved.