Why does curl not work, but wget works?
Asked Answered
P

2

18

I am using both curl and wget to get this url: http://opinionator.blogs.nytimes.com/2012/01/19/118675/

For curl, it returns no output at all, but with wget, it returns the entire HTML source:

Here are the 2 commands. I've used the same user agent, and both are coming from the same IP, and are following redirects. The URL is exactly the same. For curl, it returns immediately after 1 second, so I know it's not a timeout issue.

curl -L -s "http://opinionator.blogs.nytimes.com/2012/01/19/118675/" --max-redirs 10000 --location --connect-timeout 20 -m 20 -A "Mozilla/5.0 (Windows NT 5.2; rv:2.0.1) Gecko/20100101 Firefox/4.0.1" 2>&1

wget http://opinionator.blogs.nytimes.com/2012/01/19/118675/ --user-agent="Mozilla/5.0 (Windows NT 5.2; rv:2.0.1) Gecko/20100101 Firefox/4.0.1" 

If NY Times might be cloaking, and not returning the source to curl, what could be different in the headers curl is sending? I assumed since the user agent is the same, the request should look exactly the same from both of these requests. What other "footprints" should I check?

Polynuclear answered 8/1, 2014 at 3:18 Comment(6)
this one would help you? #8299227Stomatal
doesnt helpt at all :(Polynuclear
I suggest adding the -v flag to your curl request to show you everything that is going on. The -d flag added to your wget request shows you what is happening with the successful wget request. Both programs are redirected to a login page, but somehow wget successfully retrieves the target resource, but curl is continuously redirected until it gets a bad redirection and gives up. From a brief look at the output, it looks like wget is properly sending cookies back to nytimes.com while curl is never sending any cookies back.Freaky
Try using -c cookie.txt with your curl and optionally use -b RMID.Hulbert
Thanks, sending cookies works. What does -b RMID do?Polynuclear
It forces sending an empty RMID cookie.Hulbert
F
24

The way to solve is to analyze your curl request by doing curl -v ... and your wget request by doing wget -d ... which shows that curl is redirected to a login page

> GET /2012/01/19/118675/ HTTP/1.1
> User-Agent: Mozilla/5.0 (Windows NT 5.2; rv:2.0.1) Gecko/20100101 Firefox/4.0.1
> Host: opinionator.blogs.nytimes.com
> Accept: */*
> 
< HTTP/1.1 303 See Other
< Date: Wed, 08 Jan 2014 03:23:06 GMT
* Server Apache is not blacklisted
< Server: Apache
< Location: http://www.nytimes.com/glogin?URI=http://opinionator.blogs.nytimes.com/2012/01/19/118675/&OQ=_rQ3D0&OP=1b5c69eQ2FCinbCQ5DzLCaaaCvLgqCPhKP
< Content-Length: 0
< Content-Type: text/plain; charset=UTF-8

followed by a loop of redirections (which you must have noticed, because you have already set the --max-redirs flag).

On the other hand, wget follows the same sequence except that it returns the cookie set by nytimes.com with its subsequent request(s)

---request begin---
GET /2012/01/19/118675/?_r=0 HTTP/1.1
User-Agent: Mozilla/5.0 (Windows NT 5.2; rv:2.0.1) Gecko/20100101 Firefox/4.0.1
Accept: */*
Host: opinionator.blogs.nytimes.com
Connection: Keep-Alive
Cookie: NYT-S=0MhLY3awSMyxXDXrmvxADeHDiNOMaMEZFGdeFz9JchiAIUFL2BEX5FWcV.Ynx4rkFI

The request sent by curl never includes the cookie.

The easiest way I see to modify your curl command and obtain the desired resource is by adding -c cookiefile to your curl command. This stores the cookie in the otherwise unused temporary "cookie jar" file called "cookiefile" thereby enabling curl to send the needed cookie(s) with its subsequent requests.

For example, I added the flag -c x directly after "curl " and I obtained the output just like from wget (except that wget writes it to a file and curl prints it on STDOUT).

Freaky answered 8/1, 2014 at 3:39 Comment(1)
-v is usually very helpfulWindsail
E
0

In my case was because the https_proxy enviroment variable for utility cURL needs set the port in the URL, for example :

Not work with cURL : https_proxy=http://proxyapp.net.com/

Works with cURL : https_proxy=http://proxyapp.net.com:80/

With "wget" utility works with and without the port in url, but curl needs it, in case of not set the utility "curl" return error "(56) Proxy CONNECT aborted".

When you get verbosity of the command "curl -v" could see "curl" use port "1080" as default if port in not set at proxy url.

Emelda answered 17/3, 2022 at 8:7 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.