I am using both curl and wget to get this url: http://opinionator.blogs.nytimes.com/2012/01/19/118675/
For curl, it returns no output at all, but with wget, it returns the entire HTML source:
Here are the 2 commands. I've used the same user agent, and both are coming from the same IP, and are following redirects. The URL is exactly the same. For curl, it returns immediately after 1 second, so I know it's not a timeout issue.
curl -L -s "http://opinionator.blogs.nytimes.com/2012/01/19/118675/" --max-redirs 10000 --location --connect-timeout 20 -m 20 -A "Mozilla/5.0 (Windows NT 5.2; rv:2.0.1) Gecko/20100101 Firefox/4.0.1" 2>&1
wget http://opinionator.blogs.nytimes.com/2012/01/19/118675/ --user-agent="Mozilla/5.0 (Windows NT 5.2; rv:2.0.1) Gecko/20100101 Firefox/4.0.1"
If NY Times might be cloaking, and not returning the source to curl, what could be different in the headers curl is sending? I assumed since the user agent is the same, the request should look exactly the same from both of these requests. What other "footprints" should I check?
-v
flag to your curl request to show you everything that is going on. The-d
flag added to your wget request shows you what is happening with the successful wget request. Both programs are redirected to a login page, but somehow wget successfully retrieves the target resource, but curl is continuously redirected until it gets a bad redirection and gives up. From a brief look at the output, it looks like wget is properly sending cookies back to nytimes.com while curl is never sending any cookies back. – Freaky-c cookie.txt
with yourcurl
and optionally use-b RMID
. – HulbertRMID
cookie. – Hulbert