Using wget but ignore url parameters
Asked Answered
R

5

27

I want to download the contents of a website where the URLs are built as

http://www.example.com/level1/level2?option1=1&option2=2

Within the URL only the http://www.example.com/level1/level2 is unique for each page, and the values for option1 and option2 are changing. In fact, every unique page can have hundreds of different notations due to these variables. I am using wget to fetch all the site's content. Because of the problem I already downloaded more than 3GB of data. Is there a way to tell wget to ignore everything behind the URL's question mark? I can't find it in the man pages.

Recountal answered 4/11, 2014 at 13:19 Comment(4)
Let's hope that the URL without parameter still returns you something useful.Disconcert
It does. There is no difference if there is or is not anything behind the question mark. Seems to track where the browser came from or so.Recountal
Based on the wget man page, there is no matching against query strings with wget at this point in time. Any specific reason to use wget an not something like scrapy or curl with a bit of shell script?Joleen
Nope, nothing specific. I am used to using wget, but not a real requirement. Any suggestions for an alternative?Recountal
C
37

You can use --reject-regex to specify the pattern to reject the specific URL addresses, e.g.

wget --reject-regex "(.*)\?(.*)" -m -c --content-disposition http://example.com/

This will mirror the website, but it'll ignore the addresses with question mark - useful for mirroring wiki sites.

Culley answered 23/7, 2016 at 15:45 Comment(3)
Thank you, this is the best possible wget-only solution (without involving additional tools like a filtering proxy). Each html page is still fetched once to parse the links, but it avoids repeatedly fetching+deleting the same link with GET params, such as header links in a web server file listing.Stratocracy
even better solution!Recountal
What's the difference between --reject-regex "(.*)\?(.*)" and --reject-regex ".*\?.*" in this context?Silk
S
5

wget2 has this built in via options --cut-url-get-vars and --cut-file-get-vars.

Stonechat answered 18/12, 2021 at 15:7 Comment(0)
S
1

It does not help in your case, but for those who have already downloaded all of these files. You can quickly rename the files to remove the question mark and everything after it as follows:

rename -v -n 's/[?].*//' *[?]*

The above command does a trial run and shows you how files will be renamed. If everything looks good with the trial run, then run the command again without the -n (nono) switch.

Schick answered 29/6, 2020 at 19:43 Comment(0)
R
0

Problem solved. I noticed that the URLs that i want to download are all search engine friendly, where descriptions were formed using a dash:

http://www.example.com/main-topic/whatever-content-in-this-page

All other URLs had references to the CMS. I got all I neede with

wget -r http://www.example.com -A "*-*"

This did the trick. Thanks for thought sharing!

Recountal answered 4/11, 2014 at 15:17 Comment(1)
Glad this worked for you, but it isn't a solution to your original question, "Is there a way to tell wget to ignore everything behind the URL's question mark?" kenorb provided the best solution for anyone else who encounters this issueStratocracy
L
0

@kenorb's answer using --reject-regex is good. It did not work in my case though on an older version of wget. Here is the equivalent using wildcards that works with GNU Wget 1.12:

wget --reject "*\?*" -m -c --content-disposition http://example.com/

Lacker answered 20/7, 2020 at 20:42 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.