How can I perform a GET request without downloading the content?
Asked Answered
N

3

28

I am working on a link checker, in general I can perform HEAD requests, however some sites seem to disable this verb, so on failure I need to also perform a GET request (to double check the link is really dead)

I use the following code as my link tester:

public class ValidateResult
{
  public HttpStatusCode? StatusCode { get; set; }
  public Uri RedirectResult { get; set; }
  public WebExceptionStatus? WebExceptionStatus { get; set; }
}


public ValidateResult Validate(Uri uri, bool useHeadMethod = true, 
            bool enableKeepAlive = false, int timeoutSeconds = 30)
{
  ValidateResult result = new ValidateResult();

  HttpWebRequest request = WebRequest.Create(uri) as HttpWebRequest;
  if (useHeadMethod)
  {
    request.Method = "HEAD";
  }
  else
  {
    request.Method = "GET";
  }

  // always compress, if you get back a 404 from a HEAD it can be quite big.
  request.AutomaticDecompression = DecompressionMethods.GZip;
  request.AllowAutoRedirect = false;
  request.UserAgent = UserAgentString;
  request.Timeout = timeoutSeconds * 1000;
  request.KeepAlive = enableKeepAlive;

  HttpWebResponse response = null;
  try
  {
    response = request.GetResponse() as HttpWebResponse;

    result.StatusCode = response.StatusCode;
    if (response.StatusCode == HttpStatusCode.Redirect ||
      response.StatusCode == HttpStatusCode.MovedPermanently ||
      response.StatusCode == HttpStatusCode.SeeOther)
    {
      try
      {
        Uri targetUri = new Uri(Uri, response.Headers["Location"]);
        var scheme = targetUri.Scheme.ToLower();
        if (scheme == "http" || scheme == "https")
        {
          result.RedirectResult = targetUri;
        }
        else
        {
          // this little gem was born out of http://tinyurl.com/18r 
          // redirecting to about:blank
          result.StatusCode = HttpStatusCode.SwitchingProtocols;
          result.WebExceptionStatus = null;
        }
      }
      catch (UriFormatException)
      {
        // another gem... people sometimes redirect to http://nonsense:port/yay
        result.StatusCode = HttpStatusCode.SwitchingProtocols;
        result.WebExceptionStatus = WebExceptionStatus.NameResolutionFailure;
      }

    }
  }
  catch (WebException ex)
  {
    result.WebExceptionStatus = ex.Status;
    response = ex.Response as HttpWebResponse;
    if (response != null)
    {
      result.StatusCode = response.StatusCode;
    }
  }
  finally
  {
    if (response != null)
    {
      response.Close();
    }
  }

  return result;
}

This all works fine and dandy. Except that when I perform a GET request, the entire payload gets downloaded (I watched this in wireshark).

Is there any way to configure the underlying ServicePoint or the HttpWebRequest not to buffer or eager load the response body at all?

(If I were hand coding this I would set the TCP receive window really low, and then only grab enough packets to get the Headers, stop acking TCP packets as soon as I have enough info.)

for those wondering what this is meant to achieve, I do not want to download a 40k 404 when I get a 404, doing this a few hundred thousand times is expensive on the network

Nitrification answered 25/5, 2012 at 3:24 Comment(4)
note, even though hand coding the HTTP version is fairly simple, the HTTPS one scares me a bit. (perhaps there is an OS library that does this already?)Nitrification
Try a partial download. It is posible to Download just a range with the range http header.Schistosome
@Schistosome Content-Range may be ok for HTTP 1.1 servers that have the content, but if you get a 404 it would still be sent back completelyNitrification
I sympathise with you as I wrote a link-checker and faced the same problem. Certain well known domains, such as Wikipedia and IMDB, inexplicably disallow HEAD requests. Never found an adequate solution, I'm afraid!Prevision
B
10

When you do a GET, the server will start sending data from the start of the file to the end. Unless you interrupt it. Granted, at 10 Mb/sec, that's going to be a megabyte per second so if the file is small you'll get the whole thing. You can minimize the amount you actually download in a couple of ways.

First, you can call request.Abort after getting the response and before calling response.close. That will ensure that the underlying code doesn't try to download the whole thing before closing the response. Whether this helps on small files, I don't know. I do know that it will prevent your application from hanging when it's trying to download a multi-gigabyte file.

The other thing you can do is request a range, rather than the entire file. See the AddRange method and its overloads. You could, for example, write request.AddRange(512), which would download only the first 512 bytes of the file. This depends, of course, on the server supporting range queries. Most do. But then, most support HEAD requests, too.

You'll probably end up having to write a method that tries things in sequence:

  • try to do a HEAD request. If that works (i.e. doesn't return a 500), then you're done
  • try GET with a range query. If that doesn't return a 500, then you're done.
  • do a regular GET, with a request.Abort after GetResponse returns.
Bolin answered 25/5, 2012 at 14:17 Comment(3)
A call to request.Abort, early enough will cause the ACK to go back with a "FIN" flag set, this will close the connection gracefully without the client receiving a big pile of data. Only slight question mark I have is about the ability to set the client receive window size...Nitrification
there are a few critical corrections ... HEAD may return 404 yet get can return a 200. GET range query really makes little difference in wake of a functioning abort. (should be i.e. returns status code less than 400)Nitrification
"You could, for example, write request.AddRange(512), which would download only the first 512 bytes of the file." Shouldn't this be -512? MSDN states: "If range is negative, the range parameter specifies the ending point of the range. The server should start sending data from the start of the data in the HTTP entity to the range parameter specified." (msdn.microsoft.com/en-us/library/4ds43y3w)Okra
D
1

If you are using a GET request, you will receive the message-body whether you want to or not. The data will still be transmitted to your endpoint regardless of whether or not you read it from the socket or not. The data will just stay queued in the RecvQ waiting to be selected out.

For this, you really should be using a "HEAD" request if possible, which will spare you the message body.

Dolhenty answered 25/5, 2012 at 12:14 Comment(1)
See Jim's answer, the .Abort method does work, it sets the FIN flag with the ACK, that shuts the connection gracefullyNitrification
R
-1

Couldn't you use a WebClient to open a stream and read just the few bytes you require?

using (var client = new WebClient())
        {
            using (var stream = client.OpenRead(uri))
            {
                const int chunkSize = 100;
                var buffer = new byte[chunkSize];
                int bytesRead;
                while ((bytesRead = stream.Read(buffer, 0, buffer.Length)) > 0)
                {
                    //check response here
                }
            }
        }

I am not sure how WebClient opens the stream internally. But it seems to allow partial reading of data.

Refugiorefulgence answered 25/5, 2012 at 4:23 Comment(2)
WebClient.OpenRead(...) uses GetResponse() method internally, so this method won't work. It it will download the whole thing.Dian
Yes, I tried it too. Can't seem to find any inbuilt class that allows one to process partial web responses. Should have been possible atleast when using async operations.Refugiorefulgence

© 2022 - 2024 — McMap. All rights reserved.