How to ignore specific type of files to download in wget?
Asked Answered
J

4

44

How do I ignore .jpg, .png files in wget as I wanted to include only .html files.

I am trying:

wget  -R index.html,*tiff,*pdf,*jpg -m http://example.com/

but it's not working.

Jointer answered 14/7, 2013 at 10:25 Comment(1)
wget *.html -m http://web.123.org ? Offtopic question.Gerik
C
65

Use the

 --reject jpg,png  --accept html

options to exclude/include files with certain extensions, see http://www.gnu.org/software/wget/manual/wget.html#Recursive-Accept_002fReject-Options.

Put patterns with wildcard characters in quotes, otherwise your shell will expand them, see http://www.gnu.org/software/wget/manual/wget.html#Types-of-Files

Cherriecherrita answered 14/7, 2013 at 10:32 Comment(1)
This question is half duplicate to superuser.com/questions/487198/…Cherriecherrita
C
17
# -r : recursive    
# -nH : Disable generation of host-prefixed directories
# -nd : all files will get saved to the current directory
# -np : Do not ever ascend to the parent directory when retrieving recursively. 
# -R : don't download files with this files pattern
# -A : get only *.html files (for this case)

For instance:

wget -r -nH -nd -np -A "*.html" -R "*.gz, *.tar"  http://www1.ncdc.noaa.gov/pub/data/noaa/1990/
Cassiecassil answered 8/6, 2016 at 5:37 Comment(4)
Why -1 to that answer? I'm not posted, but I don't see why it's downvoted.Harcourt
no idea, I've been looking for -nd option which is not listed in 'Recursive Retrieval Options' in manual, so upvotedGittle
I believe -A *.html is an error -- it will get expanded by the shell. I should be -A "*.html".Casemaker
All of the uses of * without quoting may be intercepted by the shell, if a local file matches them, and should be quoted.Reeta
Z
2

Worked example to download all files excluding archives:

wget -r -k -l 7 -E -nc \
 -R "*.gz, *.tar, *.tgz, *.zip, *.pdf, *.tif, *.bz, *.bz2, *.rar, *.7z" \
 -erobots=off \
 --user-agent="Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36" \
 http://misis.ru/
Zuleika answered 19/7, 2018 at 8:12 Comment(0)
S
1

this is what I get from wget --help:

Recursive accept/reject:
  -A,  --accept=LIST               comma-separated list of accepted extensions.
  -R,  --reject=LIST               comma-separated list of rejected extensions.
       --accept-regex=REGEX        regex matching accepted URLs.
       --reject-regex=REGEX        regex matching rejected URLs.
       --regex-type=TYPE           regex type (posix|pcre).
  -D,  --domains=LIST              comma-separated list of accepted domains.
       --exclude-domains=LIST      comma-separated list of rejected domains.
       --follow-ftp                follow FTP links from HTML documents.
       --follow-tags=LIST          comma-separated list of followed HTML tags.
       --ignore-tags=LIST          comma-separated list of ignored HTML tags.
  -H,  --span-hosts                go to foreign hosts when recursive.
  -L,  --relative                  follow relative links only.
  -I,  --include-directories=LIST  list of allowed directories.
  --trust-server-names             use the name specified by the redirection
                                   url last component.
  -X,  --exclude-directories=LIST  list of excluded directories.
  -np, --no-parent                 don't ascend to the parent directory.

so you can use -R or --reject to reject extentions this way:

wget -R="index.html,*.tiff,*.pdf,*.jpg" http://example.com/

and in my case here is final command which I wanted to recursively download/update none-html files from an indexed website directory:

wget -N -r -np -nH --cut-dirs=3 -nv -R="*.htm*,*.html" http://example.com/1/2/3/
Snuffbox answered 15/3, 2022 at 15:4 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.