Download all files of a particular type from a website using wget stops in the starting url
Asked Answered
A

4

8

The following did not work.

wget -r -A .pdf home_page_url

It stop with the following message:

....
Removing site.com/index.html.tmp since it should be rejected.
FINISHED

I don't know why it only stops in the starting url, do not go into the links in it to search for the given file type.

Any other way to recursively download all pdf files in an website. ?

Advowson answered 16/8, 2013 at 13:33 Comment(1)
Possible duplicate of How to download all links to .zip files on a given web page using wget/curl?Coacher
S
1

It may be based on a robots.txt. Try adding -e robots=off.

Other possible problems are cookie based authentication or agent rejection for wget. See these examples.

EDIT: The dot in ".pdf" is wrong according to sunsite.univie.ac.at

Saintjust answered 16/8, 2013 at 13:39 Comment(6)
Tried but same result. Its not a cookie based website for sure. I could download using python urllib open recursively.May be the log will help you. It basically downloads home page says Removing <homepage url> since it should be rejected. Then hits a page which has no links and stopes there. What about the other links in hope mage ?Advowson
Tried what? Removing the dot? Ignoring the robots.txt? Or simulating a browser? Or all of them?Saintjust
Tried removing dot and ignoring robotAdvowson
Might try the browser. http://www.askapache.com/linux/wget-header-trick.htmlSaintjust
This user had a similar problem and it seems he's solved it.Saintjust
For anybody coming across this answer via search engine, please note that using '-A .ext' and '-A ext' are exactly the same. The documentation OP points to shows an example of both cases, and does not explicitly state anything about the '.'.Apodaca
J
1

the following cmd works for me, it will download pictures of a site

wget -A pdf,jpg,png -m -p -E -k -K -np http://site/path/
Javierjavler answered 3/6, 2015 at 6:27 Comment(0)
S
0

This is certainly because of the links in the HTML don't end up with /.

Wget will not follow this has it think it's a file (but doesn't match your filter):

<a href="link">page</a>

But will follow this:

<a href="link/">page</a>

You can use the --debug option to see if it's the actual problem.

I don't know any good solution for this. In my opinion this is a bug.

Seismism answered 5/12, 2019 at 0:2 Comment(0)
E
0

In my version of wget (GNU Wget 1.21.3), the -A/--accept and -r/--recursive flags don't play nicely with each other.

Here's my script for scraping a domain for PDFs (or any other filetype):


wget --no-verbose --mirror --spider https://example.com -o - | while read line
do
  [[ $line == *'200 OK' ]] || continue
  [[ $line == *'.pdf'* ]] || continue
  echo $line | cut -c25- | rev | cut -c7- | rev | xargs wget --no-verbose -P scraped-files
done

Explanation: Recursively crawl https://example.com and pipe log output (containing all scraped URLs) to a while read block. When a line from the log output contains a PDF URL, strip the leading timestamp (25 characters) and tailing request info (7 characters) and use wget to download the PDF.

Enthusiastic answered 6/12, 2022 at 15:6 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.