WebRequest "HEAD" light weight alternative
Asked Answered
M

3

7

I recently discovered that the following does not work with certain sites, such as IMDB.com.

class Program
    {
        static void Main(string[] args)
        {
            try
            {
                System.Net.WebRequest wc = System.Net.WebRequest.Create("http://www.imdb.com"); //args[0]);

                ((HttpWebRequest)wc).UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.19 (KHTML, like Gecko) Chrome/0.2.153.1 Safari/525.19";
                wc.Timeout = 1000;
                wc.Method = "HEAD";
                WebResponse res = wc.GetResponse();
                var streamReader = new System.IO.StreamReader(res.GetResponseStream());

                Console.WriteLine(streamReader.ReadToEnd());
            }
            catch (Exception ex)
            {
                Console.WriteLine(ex.Message);
            }
        }
    }

It returns an HTTP 405 ( Method Not Allowed ). My problem is, I use code very similar to the above to check if a link is valid and the vast majority of times it works correctly. I can switch it to method equal GET and it works ( with an increase in timeout ), but this slows things down by an order of magnitude. I am assuming the 405 response is a server configuration on IMDB's server side.

Is there a way for me to do the same thing as above, in a light weight manner in .NET? Or, is there a way to fix the above code so it works as a GET request that works with imdb?

Manns answered 18/3, 2011 at 15:1 Comment(2)
I had to increase the timeout, but the code you posted above works for me. Changing it to POST would make no sense, because you don't have any data to post. And your title talks about HEAD, but you're not doing a HEAD request. Please clarify what the question is, since your "broken" code works fine.Harkins
Ug, really stupid typo in the title. Fixed now... classic example of think one thing and type another. When you run the above code, you aren't getting a 405 response? EDIT: Ok, realized even my code was flawed. The above is what I meant to post, and is edited to give the 405 error ( and make sense..... )Manns
D
4

You'll have to clarify what you mean by "lightweight". What are you trying to accomplish?

Whether or not you can use GET/POST/HEAD/DELETE/etc will depend on the URL and what's configured in the application that is running on the server at that URL.

If all you're trying to do is see if you can make a connection without actually downloading the content you could maybe try just initiating a connection to port 80 using sockets, but there isn't really reliable or universally supported way just by changing the HTTP method.

Diatessaron answered 18/3, 2011 at 15:41 Comment(4)
Well essentially what I am using HEAD request for now are a) to check if a site actually exists b)if a site exists, for each link within, verify they actually exists ( therefore each image, style sheet, etc... ). Therefore, on some image heavy pages, it could literally be called hundreds of times. So, by lightweight I mean mostly network traffic.Manns
Right... the only more lightweight method I could think of in regards to bandwidth would be to use sockets to manually construct your HTTP requests, get back enough of the response to determine the HTTP status code, and then close the connection.Diatessaron
Would going the route of hand crafted HTTP actually circumvent the 405 error results? EDIT: Er, status results I should have said, I suppose technically HTTP 405 isn't actually an error. It's only a handful of sites that are returning 405, and I don't actually know what part is causing that response. Right now, I am assuming its the HEAD request, but I am not sure.Manns
The HEAD request is what would be causing the issue. What I mean by the hand craft HTTP request is that you'd use a GET, which is what the server would expect, but since you'd be able to control what you download, you'd be able to download just the response headers and then terminate the connection before downloading the body.Diatessaron
I
6

Open the connection yourself with a socket (instead of an HttpRequest or WebClient), and close the stream as soon as you've read the status code. Fortunately the status code comes near the top of the response stream :)

Injector answered 18/3, 2011 at 15:54 Comment(0)
D
4

You'll have to clarify what you mean by "lightweight". What are you trying to accomplish?

Whether or not you can use GET/POST/HEAD/DELETE/etc will depend on the URL and what's configured in the application that is running on the server at that URL.

If all you're trying to do is see if you can make a connection without actually downloading the content you could maybe try just initiating a connection to port 80 using sockets, but there isn't really reliable or universally supported way just by changing the HTTP method.

Diatessaron answered 18/3, 2011 at 15:41 Comment(4)
Well essentially what I am using HEAD request for now are a) to check if a site actually exists b)if a site exists, for each link within, verify they actually exists ( therefore each image, style sheet, etc... ). Therefore, on some image heavy pages, it could literally be called hundreds of times. So, by lightweight I mean mostly network traffic.Manns
Right... the only more lightweight method I could think of in regards to bandwidth would be to use sockets to manually construct your HTTP requests, get back enough of the response to determine the HTTP status code, and then close the connection.Diatessaron
Would going the route of hand crafted HTTP actually circumvent the 405 error results? EDIT: Er, status results I should have said, I suppose technically HTTP 405 isn't actually an error. It's only a handful of sites that are returning 405, and I don't actually know what part is causing that response. Right now, I am assuming its the HEAD request, but I am not sure.Manns
The HEAD request is what would be causing the issue. What I mean by the hand craft HTTP request is that you'd use a GET, which is what the server would expect, but since you'd be able to control what you download, you'd be able to download just the response headers and then terminate the connection before downloading the body.Diatessaron
H
4

If HEAD returns a 405, that means the server doesn't support HEAD (at least for that URL) and you'll have fall back to GET instead. The majority of sites should support HEAD, so you probably want to do HEAD by default, but if it throws a 405, you could maybe fall back to GET for that domain. Or maybe you want to try HEAD first for each request; YMMV.

If the server requires GET and you want to reduce network traffic, you could try doing a conditional GET and/or a partial GET (see e.g. RFC2616). I've never tried doing those with WebRequest but I think it lets you add custom outgoing HTTP headers, so you should be able to do it.

Also, don't forget that, if you're writing a spider (which you clearly are), you should respect the server's robots.txt, and it's also courteous to throttle your requests to something like one request every two seconds, so you don't slashdot the server.

Harkins answered 18/3, 2011 at 16:2 Comment(1)
Thank you for the response. I'm not actually writing a spider, the end product is closer in nature to a web browser than anything else. I did as you suggested earlier ( HEAD request, then on 405 a full GET ), which is my current way of doing things but it is sub-optimal. I will look into partial GETs, that would probably be perfect. Thanks.Manns

© 2022 - 2024 — McMap. All rights reserved.