c# HttpWebResponse Header encoding
Asked Answered
V

3

9

I have the following problem. I contact an address which I know employs a 301 redirect.

using HttpWebRequest loHttp = (HttpWebRequest)WebRequest.Create(lcUrl); and loHttp.AllowAutoRedirect = false; so that I am not redirected.

Now I get the header of the response in order to identify the new url.

using loWebResponse.GetResponseHeader("Location");

The problem is that since this url contains greek characters the string returned is all jumbled up (due to encoding).

The full picture codewise:

HttpWebRequest loHttp = (HttpWebRequest)WebRequest.Create(lcUrl);
loHttp.ContentType = "application/x-www-form-urlencoded";
loHttp.Method = "GET";

Timeout = 10000;

loHttp.AllowAutoRedirect = false;
HttpWebResponse loWebResponse = (HttpWebResponse)loHttp.GetResponse();

string url= loWebResponse.Headers["Location"];
Vyborg answered 11/12, 2009 at 15:46 Comment(2)
By default HttpWebRequest will follow redirects, so if a server sends 301/302 status code a new request will be issued to fetch the resource using the Location header. So once this final resource is fetched there will no longer be a Location header in the response, so I wonder how comes that loWebResponse.GetResponseHeader("Location") returns anything other than an empty string. This aside, have you verified with FireBug that the site performs a correct encoding on the Location header?Lorrin
I didn't make it clear that 'loHttp.AllowAutoRedirect = false;' is set so I can inspect the redirect urlVyborg
L
6

If you let the default behavior (loHttp.AllowAutoRedirect = true) and your code doesn't work (you don't get redirected to the new resource) it means that the server is not encoding the Location header correctly. Is the redirect working in the browser?

For example if the redirect url is http://site/Μία_Σελίδα the Location header must look like http://site/%CE%95%CE%BD%CE%B9%CE%B1%CE%AF%CE%BF_%CE%94%CE%B5%CE%.


UPDATE:

After further investigating the issue I begin to suspect that there's something strange with HttpWebRequest. When the request is sent the server sends the following response:

HTTP/1.1 301 Moved Permanently
Date: Fri, 11 Dec 2009 17:01:04 GMT
Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET
Location: http://www.site.com/buy/κινητή-σταθερή-τηλεφωνία/c/cn69569/
Content-Length: 112
Content-Type: text/html; Charset=UTF-8
Cache-control: private
Connection: close
Set-Cookie: BIGipServerpool_webserver_gr=1007732746.36895.0000; path=/


<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">

As we can see the Location header contains greek characters which are not url encoded. I am not quite sure if this is valid according to the HTTP specification. What we can say for sure is that a web browser interprets it correctly.

Here comes the interesting part. It seems that HttpWebRequest doesn't use UTF-8 encoding to parse the response headers because when analyzing the Location header it gives: http://www.site.com/buy/κινηÏή-ÏÏαθεÏή-ÏηλεÏÏνία/c/cn69569/, which of course is wrong and when it tries to redirect to this location the server responds with a new redirect and so on until the maximum number of redirects is reached and an exception is thrown.

I couldn't find any way to specify the encoding used by HttpWebRequest when parsing the response headers. If we use TcpCLient manually it works perfectly fine:

using (var client = new TcpClient())
{
    client.Connect("www.site.com", 80);

    using (var stream = client.GetStream())
    {
        var writer = new StreamWriter(stream);
        writer.WriteLine("GET /default/defaultcatg.asp?catg=69569 HTTP/1.1");
        writer.WriteLine("Host: www.site.com");
        writer.WriteLine("User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.2) Gecko/20090805 Shiretoko/3.5.2");
        writer.WriteLine("Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
        writer.WriteLine("Accept-Language: en-us,en;q=0.5");
        writer.WriteLine("Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7");
        writer.WriteLine("Connection: close");
        writer.WriteLine(string.Empty);
        writer.WriteLine(string.Empty);
        writer.WriteLine(string.Empty);
        writer.Flush();

        var reader = new StreamReader(stream);
        var response = reader.ReadToEnd();
        // When looking at the response it correctly reads 
        // Location: http://www.site.com/buy/κινητή-σταθερή-τηλεφωνία/c/cn69569/
    }
}

So I am really puzzled by this behavior. Is there any way to specify the correct encoding used by HttpWebRequest? Maybe some request header should be set?

As a workaround you could try modifying the asp page that performs the redirect and urlencode the Location header. For example when in an ASP.NET application you perform a Response.Redirect(location), the location will be automatically html encoded and any non standard characters will be converted to their corresponding entities.

For example if you do: Response.Redirect("http://www.site.com/buy/κινητή-σταθερή-τηλεφωνία/c/cn69569/"); in an ASP.NET application the Location header will be set to :

http://www.site.com/buy/%ce%ba%ce%b9%ce%bd%ce%b7%cf%84%ce%ae-%cf%83%cf%84%ce%b1%ce%b8%ce%b5%cf%81%ce%ae-%cf%84%ce%b7%ce%bb%ce%b5%cf%86%cf%89%ce%bd%ce%af%ce%b1/c/cn69569

It seems that this is not the case with classic ASP.

Lorrin answered 11/12, 2009 at 16:20 Comment(3)
if I leave it to true, then i get an exception (either a timeout or a maximum redirects exception). In the browser it works fine in terms of reaching the right page. So my guess is that I am doing something wrong in reading the Location headers.Vyborg
Is there any chance you could post the actual URL so that I can take a look at it? Or maybe it is not publicly accessible?Lorrin
In .Net, the parsing of headers is handled in a "pure ASCII" encoding that is encapsulated inside the WebHeaderCollection class. This is in compliance with RFC 2616. Whoever's handing out that Location header is DOING IT WRONG, but most browsers "just handle it", by assuming the charset is UTF-8 (what's in the actual octet stream).Curcuma
M
1

I would not expect the return string to be malformed...how are you determining that it is malformed? The string should be in a unicode format like utf-8 that would be able to represent the greek string easily.

It could be that you just don't have the greek fonts to represent the string?

Manhole answered 11/12, 2009 at 15:48 Comment(2)
by malformed I mean not in a readable encoding. this is what the getResponseHeader returns "site.com/buy/…"Vyborg
hmmm in visual studio it appears a bit different :S but still as you see the middle part is ruinedVyborg
S
1

As Darin Dimitrov explains, I believe that the header encoding is caused by a bug in the HttpWebResponse class. We've had the same issue where we wanted to add a cookie to the header (Set-Cookie) and this cookie would contain non-Ascii characters. In our spesific case this would be the Norwegian letters 'Æ', 'Ø' and 'Å' (in upper and lower-case). We couldn't figure out how to get the HeaderEncoding to work, but we found a work-around using Base64-encoding of the cookie. Note that this will only work if you're in control of both the client and server side (or you can convince the people in charge of the server side code to add the Base64 encoding for you...)

On the server side:

var cookieData = "This text contains Norwegian letters; ÆØÅæøå";
var cookieDataAsUtf8Bytes = System.Text.Encoding.UTF8.GetBytes(cookieData);
var cookieDataAsUtf8Base64Encoded = Convert.ToBase64String(cookieDataAsUtf8Bytes);
var cookie = new HttpCookie("MyCookie", cookieDataAsUtf8Base64Encoded);
response.Cookies.Add(cookie);

On the client side:

var cookieDataAsUtf8Bytes = Convert.FromBase64String(cookieDataAsUtf8Base64Encoded);
var cookieData = System.Text.Encoding.UTF8.GetString(cookieDataAsUtf8Bytes);

Note that cookieDataAsUtf8Base64Encoded on the client side is the data part of the cookie (that is 'MyCookie=[data]', where 'MyCookie=' is stripped away).

Shondrashone answered 5/9, 2011 at 7:15 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.