How to get pdf filename with Python requests?

Asked 4/8, 2015 at 8:54 Answered 26/3, 2024 at 14:38

Solved python pdf python-requests filenames

I'm using the Python requests library to get a PDF file from the web. This works fine, but I now also want the original filename. If I go to a PDF file in Firefox and click download it already has a filename defined to save the pdf. How do I get this filename?

For example:

import requests
r = requests.get('http://www.researchgate.net/profile/M_Gotic/publication/260197848_Mater_Sci_Eng_B47_%281997%29_33/links/0c9605301e48beda0f000000.pdf')
print r.headers['content-type']  # prints 'application/pdf'

I checked the r.headers for anything interesting, but there's no filename in there. I was actually hoping for something like r.filename..

Does anybody know how I can get the filename of a downloaded PDF file with the requests library?

Bonnibelle answered 4/8, 2015 at 8:54 Comment(2)

Interesting – I was going to say, "well obviously 0c9605301e48beda0f000000.pdf" (as that is in the request) but fortunately I decided to test it first. And FireFox wants to save it as "Mater Sci Eng B47 (1997) 33.pdf". – Osuna 4/8, 2015 at 9:4

How are you checking the headers? The filename is there, content-disposition : inline; filename="Mater Sci Eng B47 (1997) 33.pdf". FWIW, many PDFs have a Title embedded in them, but not all, and it may not be easy to access if the PDF is in binary form. – Hooey 4/8, 2015 at 9:18

100

It is specified in an http header content-disposition. So to extract the name you would do:

import re
d = r.headers['content-disposition']
fname = re.findall("filename=(.+)", d)[0]

Name extracted from the string via regular expression (re module).

Oman answered 4/8, 2015 at 9:25 Comment(8)

This wouldn't work if the file name is encoded as utf8. Any suggestion there? – Same 21/2, 2017 at 4:14

findall returns a list of matches. You would need an index like this fname = re.findall("filename=(.+)", d)[0]. – Cattima 14/11, 2018 at 11:45

This one is incomplete, a filename can we enclosed in quotes. – Chasten 18/5, 2020 at 23:16

@Chasten try using "filename=\"(.+)\"" to remove quotes – Happygolucky 15/10, 2020 at 1:42

Just a side case that sometimes expected filenames are not provided within headers, especially with social media CDN links. In that case, you can formulate your own base name (maybe parse the url for the root filename that you would like to use), and then ascertain the correct extension to use as a suffix with something like resp.headers['Content-Type'].split('/')[-1]. – Sprightly 17/6, 2021 at 17:17

In my case, the regex did not work because my 'content-disposition' also contains filename=*UTF-8: 'Content-Disposition': "attachment; filename=NameOfTheFile.zip; filename*=UTF-8''NameOfTheFile.zip" – Barkeeper 3/10, 2021 at 10:17

You can use cgi.parse_header and email.header.decode_header to parse the file name properly – Footloose 16/3, 2023 at 7:18

@Barkeeper @tony-abou-assaleh, I use unquote(header.split("filename*=")[1].replace('UTF-8\'\'',"")) for Unicode – Mustache 29/6, 2023 at 8:35

Building on some of the other answers, here's how I do it. If there isn't a Content-Disposition header, I parse it from the download URL:

import re
import requests
from requests.exceptions import RequestException


url = 'http://www.example.com/downloads/sample.pdf'

try:
    with requests.get(url) as r:

        fname = ''
        if "Content-Disposition" in r.headers.keys():
            fname = re.findall("filename=(.+)", r.headers["Content-Disposition"])[0]
        else:
            fname = url.split("/")[-1]

        print(fname)
except RequestException as e:
    print(e)

There are arguably better ways of parsing the URL string, but for simplicity I didn't want to involve any more libraries.

Cattima answered 14/11, 2018 at 11:55 Comment(1)

I suggest calling urllib.parse.unquote in the else clause so you don't get %20s in the filename. – Bozeman 24/6, 2021 at 0:4

Apparently, for this particular resource it is in:

r.headers['content-disposition']

Don't know if it is always the case, though.

Drool answered 4/8, 2015 at 9:16 Comment(1)

Not all responses contain the 'content-disposition' header, but as per one of the comments, it seems they are available in this case. – Sweater 23/6, 2018 at 22:20

easy python3 implementation to get filename from Content-Disposition:

import requests
response = requests.get(<your-url>)
print(response.headers.get("Content-Disposition").split("filename=")[1])

Emilieemiline answered 5/10, 2020 at 22:20 Comment(3)

Be careful in case there is no "Content-Disposition" header! – Cant 21/6, 2021 at 10:12

could use something like response.headers.get("Content-Disposition","filename=output.bin") to cover the missing header. – Tabshey 11/3, 2022 at 20:46

You have to also remove " because it is filename="xxxx.xxx" – Chamorro 18/6, 2024 at 10:56

You can use werkzeug for options headers https://werkzeug.palletsprojects.com/en/0.15.x/http/#werkzeug.http.parse_options_header

>>> import werkzeug


>>> werkzeug.http.parse_options_header('text/html; charset=utf8')
('text/html', {'charset': 'utf8'})

Thorlay answered 21/8, 2019 at 13:58 Comment(1)

This is the most robust option as it removes optional quotes. – Residual 17/1, 2022 at 16:57

Use urllib.request instead of requests because then you can do urllib.request.urlopen(...).headers.get_filename(), which is safer than some of the other answers for the following reason:

If the [Content-Disposition] header does not have a filename parameter, this method falls back to looking for the name parameter on the Content-Type header.

After that, even safer would be to additionally fall back to the filename in the URL, as another answer does.

Trossachs answered 1/11, 2022 at 14:17 Comment(0)

According to the documentation, neither Content-Disposition nor its filename attribute is required. Also, I checked dozens links on the internet and haven't found responses with the Content-Disposition header. So, in most cases, I wouldn't rely on it much and just retrieve this information from the request URL (note: I'm taking it from req.url because there could be redirection and we want to get real filename). I used werkzeug because it looks more robust and handles quoted and unquoted filenames. Eventually, I came up with this solution (works since Python 3.8):

from urllib.parse import urlparse

import requests
import werkzeug


def get_filename(url: str):
    try:
        with requests.get(url) as req:
            if content_disposition := req.headers.get("Content-Disposition"):
                param, options = werkzeug.http.parse_options_header(content_disposition)
                if param == 'attachment' and (filename := options.get('filename')):
                    return filename

            path = urlparse(req.url).path
            name = path[path.rfind('/') + 1:]
            return name
    except requests.exceptions.RequestException as e:
        raise e

I wrote some tests using pytest and requests_mock:

import pytest
import requests
import requests_mock

from main import get_filename

TEST_URL = 'https://pwrk.us/report.pdf'


@pytest.mark.parametrize(
    'headers,expected_filename',
    [
        (
                {'Content-Disposition': 'attachment; filename="filename.pdf"'},
                "filename.pdf"
        ),
        (
                # The string following filename should always be put into quotes;
                # but, for compatibility reasons, many browsers try to parse unquoted names that contain spaces.
                # https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Disposition#directives
                {'Content-Disposition': 'attachment; filename=filename with spaces.pdf'},
                "filename with spaces.pdf"
        ),
        (
                {'Content-Disposition': 'attachment;'},
                "report.pdf"
        ),
        (
                {'Content-Disposition': 'inline;'},
                "report.pdf"
        ),
        (
                {},
                "report.pdf"
        )
    ]
)
def test_get_filename(headers, expected_filename):
    with requests_mock.Mocker() as m:
        m.get(TEST_URL, text='resp', headers=headers)
        assert get_filename(TEST_URL) == expected_filename


def test_get_filename_exception():
    with requests_mock.Mocker() as m:
        m.get(TEST_URL, exc=requests.exceptions.RequestException)
        with pytest.raises(requests.exceptions.RequestException):
            get_filename(TEST_URL)

Ineffable answered 21/8, 2022 at 15:5 Comment(0)

Using Python's standard library:

from email.message import EmailMessage

msg = EmailMessage()
msg["Content-Disposition"] = response.headers.get("Content-Disposition")
filename = msg.get_filename()

Like others said, the file name is in the "Content-Disposition" header.

The cgi standard library module used to be the way to parse it, but it's deprecated since py311.

The currently recommended way of parsing is using the email module, which is also part of the standard library.

References:

Slipper answered 14/11, 2023 at 15:58 Comment(1)

This is nice, but EmailMessage from Python 3.12 does not implement RFC 5987 correctly. Setting msg["Content-Disposition"] to "attachment; filename* = utf-8''example.csv" (valid according to the RFC) strips everything after the ;. – Hurlyburly 12/4, 2024 at 11:38

For me (with requests 2.31) worked with lower caps and with the code bellow:

import request

response = requests.get(file_url)
content_disposition = response.headers.get("content-disposition")
file_name = content_disposition.split("=")[1]

Huoh answered 26/3, 2024 at 14:38 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags