How do I ignore .jpg
, .png
files in wget
as I wanted to include only .html
files.
I am trying:
wget -R index.html,*tiff,*pdf,*jpg -m http://example.com/
but it's not working.
How do I ignore .jpg
, .png
files in wget
as I wanted to include only .html
files.
I am trying:
wget -R index.html,*tiff,*pdf,*jpg -m http://example.com/
but it's not working.
Use the
--reject jpg,png --accept html
options to exclude/include files with certain extensions, see http://www.gnu.org/software/wget/manual/wget.html#Recursive-Accept_002fReject-Options.
Put patterns with wildcard characters in quotes, otherwise your shell will expand them, see http://www.gnu.org/software/wget/manual/wget.html#Types-of-Files
# -r : recursive
# -nH : Disable generation of host-prefixed directories
# -nd : all files will get saved to the current directory
# -np : Do not ever ascend to the parent directory when retrieving recursively.
# -R : don't download files with this files pattern
# -A : get only *.html files (for this case)
For instance:
wget -r -nH -nd -np -A "*.html" -R "*.gz, *.tar" http://www1.ncdc.noaa.gov/pub/data/noaa/1990/
-A *.html
is an error -- it will get expanded by the shell. I should be -A "*.html"
. –
Casemaker *
without quoting may be intercepted by the shell, if a local file matches them, and should be quoted. –
Reeta Worked example to download all files excluding archives:
wget -r -k -l 7 -E -nc \
-R "*.gz, *.tar, *.tgz, *.zip, *.pdf, *.tif, *.bz, *.bz2, *.rar, *.7z" \
-erobots=off \
--user-agent="Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36" \
http://misis.ru/
this is what I get from wget --help
:
Recursive accept/reject:
-A, --accept=LIST comma-separated list of accepted extensions.
-R, --reject=LIST comma-separated list of rejected extensions.
--accept-regex=REGEX regex matching accepted URLs.
--reject-regex=REGEX regex matching rejected URLs.
--regex-type=TYPE regex type (posix|pcre).
-D, --domains=LIST comma-separated list of accepted domains.
--exclude-domains=LIST comma-separated list of rejected domains.
--follow-ftp follow FTP links from HTML documents.
--follow-tags=LIST comma-separated list of followed HTML tags.
--ignore-tags=LIST comma-separated list of ignored HTML tags.
-H, --span-hosts go to foreign hosts when recursive.
-L, --relative follow relative links only.
-I, --include-directories=LIST list of allowed directories.
--trust-server-names use the name specified by the redirection
url last component.
-X, --exclude-directories=LIST list of excluded directories.
-np, --no-parent don't ascend to the parent directory.
so you can use -R
or --reject
to reject extentions this way:
wget -R="index.html,*.tiff,*.pdf,*.jpg" http://example.com/
and in my case here is final command which I wanted to recursively download/update none-html files from an indexed website directory:
wget -N -r -np -nH --cut-dirs=3 -nv -R="*.htm*,*.html" http://example.com/1/2/3/
© 2022 - 2024 — McMap. All rights reserved.
wget *.html -m http://web.123.org
? Offtopic question. – Gerik