Why does curl not work, but wget works?

Asked 8/1, 2014 at 3:18 Answered 17/3, 2022 at 8:7

I am using both curl and wget to get this url: http://opinionator.blogs.nytimes.com/2012/01/19/118675/

For curl, it returns no output at all, but with wget, it returns the entire HTML source:

Here are the 2 commands. I've used the same user agent, and both are coming from the same IP, and are following redirects. The URL is exactly the same. For curl, it returns immediately after 1 second, so I know it's not a timeout issue.

curl -L -s "http://opinionator.blogs.nytimes.com/2012/01/19/118675/" --max-redirs 10000 --location --connect-timeout 20 -m 20 -A "Mozilla/5.0 (Windows NT 5.2; rv:2.0.1) Gecko/20100101 Firefox/4.0.1" 2>&1

wget http://opinionator.blogs.nytimes.com/2012/01/19/118675/ --user-agent="Mozilla/5.0 (Windows NT 5.2; rv:2.0.1) Gecko/20100101 Firefox/4.0.1"

If NY Times might be cloaking, and not returning the source to curl, what could be different in the headers curl is sending? I assumed since the user agent is the same, the request should look exactly the same from both of these requests. What other "footprints" should I check?

Polynuclear answered 8/1, 2014 at 3:18 Comment(6)

this one would help you? #8299227 – Stomatal 8/1, 2014 at 3:21

doesnt helpt at all :( – Polynuclear 8/1, 2014 at 3:31

I suggest adding the -v flag to your curl request to show you everything that is going on. The -d flag added to your wget request shows you what is happening with the successful wget request. Both programs are redirected to a login page, but somehow wget successfully retrieves the target resource, but curl is continuously redirected until it gets a bad redirection and gives up. From a brief look at the output, it looks like wget is properly sending cookies back to nytimes.com while curl is never sending any cookies back. – Freaky 8/1, 2014 at 3:36

Try using -c cookie.txt with your curl and optionally use -b RMID. – Hulbert 8/1, 2014 at 3:41

Thanks, sending cookies works. What does -b RMID do? – Polynuclear 8/1, 2014 at 3:42

It forces sending an empty RMID cookie. – Hulbert 8/1, 2014 at 3:43

The way to solve is to analyze your curl request by doing curl -v ... and your wget request by doing wget -d ... which shows that curl is redirected to a login page

> GET /2012/01/19/118675/ HTTP/1.1
> User-Agent: Mozilla/5.0 (Windows NT 5.2; rv:2.0.1) Gecko/20100101 Firefox/4.0.1
> Host: opinionator.blogs.nytimes.com
> Accept: */*
> 
< HTTP/1.1 303 See Other
< Date: Wed, 08 Jan 2014 03:23:06 GMT
* Server Apache is not blacklisted
< Server: Apache
< Location: http://www.nytimes.com/glogin?URI=http://opinionator.blogs.nytimes.com/2012/01/19/118675/&OQ=_rQ3D0&OP=1b5c69eQ2FCinbCQ5DzLCaaaCvLgqCPhKP
< Content-Length: 0
< Content-Type: text/plain; charset=UTF-8

followed by a loop of redirections (which you must have noticed, because you have already set the --max-redirs flag).

On the other hand, wget follows the same sequence except that it returns the cookie set by nytimes.com with its subsequent request(s)

---request begin---
GET /2012/01/19/118675/?_r=0 HTTP/1.1
User-Agent: Mozilla/5.0 (Windows NT 5.2; rv:2.0.1) Gecko/20100101 Firefox/4.0.1
Accept: */*
Host: opinionator.blogs.nytimes.com
Connection: Keep-Alive
Cookie: NYT-S=0MhLY3awSMyxXDXrmvxADeHDiNOMaMEZFGdeFz9JchiAIUFL2BEX5FWcV.Ynx4rkFI

The request sent by curl never includes the cookie.

The easiest way I see to modify your curl command and obtain the desired resource is by adding -c cookiefile to your curl command. This stores the cookie in the otherwise unused temporary "cookie jar" file called "cookiefile" thereby enabling curl to send the needed cookie(s) with its subsequent requests.

For example, I added the flag -c x directly after "curl " and I obtained the output just like from wget (except that wget writes it to a file and curl prints it on STDOUT).

Freaky answered 8/1, 2014 at 3:39 Comment(1)

-v is usually very helpful – Windsail 6/6, 2016 at 8:44

In my case was because the https_proxy enviroment variable for utility cURL needs set the port in the URL, for example :

Not work with cURL : https_proxy=http://proxyapp.net.com/

Works with cURL : https_proxy=http://proxyapp.net.com:80/

With "wget" utility works with and without the port in url, but curl needs it, in case of not set the utility "curl" return error "(56) Proxy CONNECT aborted".

When you get verbosity of the command "curl -v" could see "curl" use port "1080" as default if port in not set at proxy url.

Emelda answered 17/3, 2022 at 8:7 Comment(0)

Recommended topics

Hot tags