How to download pdf files using Python?
Asked Answered
P

3

9

I was looking for a way to download pdf files in python, and I saw answers on other questions recommending the urllib module. I tried to download a pdf file using it, but when I try to open the downloaded file, a message shows up saying that the file cannot be opened.

error message

This is the code I used-

import urllib
urllib.urlretrieve("http://papers.gceguide.com/A%20Levels/Mathematics%20(9709)/9709_s11_qp_42.pdf", "9709_s11_qp_42.pdf")

What am I doing wrong? Also, the file automatically saves to the directory my python file is in. How do I change the location to which it gets saved?

Edit- I tried again with the link to a sample pdf, http://unec.edu.az/application/uploads/2014/12/pdf-sample.pdf

The code is working with this link, so why won't it work for the other one?

Patrol answered 10/5, 2017 at 12:8 Comment(9)
You can use requests for this task: #34503912Mohenjodaro
@DavidZemens I won't call it a duplicate. The OP is concerned about his solution not working rather than finding a different one.Mohenjodaro
When I go to that url I first get a captcha (by cloudflare) to prove that I'm not a robot and only then can access the pdf. Also cloudflare sites often restrict access based on user agent. If you open the file in a text editor you'll probably find html there instead of a pdf.Ortego
You didn't actually download a PDF from that URL - you downloaded the CAPTCHA form needed to access the PDF.Tintinnabulation
So is there any way i can download files like that??Patrol
You'd probably need to complete the captcha in a browser, take the cookies that were set and user agent from the browser and use those in your request. That may work for a while, but you may be presented with a new captcha after some time.Ortego
@Ortego uhh how would you do that lmaoPatrol
If you use the above mentioned requests module, sending cookies and a custom user agent should be easy. Where to find them depends on your browser.Ortego
Try a crawler, you will need tostar session on the websiteDemur
S
14

Try this. It works.

import requests
url='https://pdfs.semanticscholar.org/c029/baf196f33050ceea9ecbf90f054fd5654277.pdf'
r = requests.get(url, stream=True)

with open('C:/Users/MICRO HARD/myfile.pdf', 'wb') as f:
f.write(r.content)
Scintillate answered 14/8, 2017 at 8:40 Comment(2)
When I attempt to open the saved file, I get: "Adobe Acrobat Reader could not open 'D:/myfile.pdf' because it is either not a supported file type of because the file has been damaged..."Sixpack
Turns out this code does work. The PDF at the url in the code above happens to be corrupt. Pointing it to the PDF I wanted worked fineSixpack
P
2

You can also use wget to download pdfs via a link:

import wget

wget.download(link)

Here's a guide about how to search & download all pdf files from a webpage in one go: https://medium.com/the-innovation/notesdownloader-use-web-scraping-to-download-all-pdfs-with-python-511ea9f55e48

Publea answered 24/12, 2020 at 9:21 Comment(0)
E
0
  • You can't download the pdf content from the given url using requests or urllib.
  • Because initially the given url was pointed to another web page after that only it loads the pdf.
  • If you have doubt save the response as html instead of pdf.
  • You need to use headless browsers like panthomJS to download files from these kind of web pages.
Eatage answered 10/5, 2017 at 13:52 Comment(1)
How would a headless browser be of any use in this case? You still need to complete the captcha, which you can't do in a headless browser.Ortego

© 2022 - 2024 — McMap. All rights reserved.