In my version of wget (GNU Wget 1.21.3), the -A
/--accept
and -r
/--recursive
flags don't play nicely with each other.
Here's my script for scraping a domain for PDFs (or any other filetype):
wget --no-verbose --mirror --spider https://example.com -o - | while read line
do
[[ $line == *'200 OK' ]] || continue
[[ $line == *'.pdf'* ]] || continue
echo $line | cut -c25- | rev | cut -c7- | rev | xargs wget --no-verbose -P scraped-files
done
Explanation: Recursively crawl https://example.com and pipe log output (containing all scraped URLs) to a while read
block. When a line from the log output contains a PDF URL, strip the leading timestamp (25 characters) and tailing request info (7 characters) and use wget to download the PDF.