TIdHTTP character encoding of POST response
Asked Answered
T

2

8

Take following situation:

procedure Test;

var
 Response : String;

begin
 Response := IdHttp.Post(MyUrL, AStream);
 DoSomethingWith(Response);
end;

Now the webserver returns me data in UTF-8. Suppose it returns me some UTF-8 XML containing the character é. If I use the variable Response it does not contain this character but it's UTF-8 variant (#C3#A9), so Indy did no decoding?

Now I know how to solve this problem:

procedure Test;

var
 Response : String;

begin
 Response := UTF8ToString(IdHttp.Post(MyUrL, AStream));
 DoSomethingWith(Response);
end;

One caveat with this solution: Delphi raises warning W1058 (Implicit string cast with potential data loss from 'string' to 'RawByteString')

My question : is this the correct way to deal with this problem or can I instruct TIdHTTP to do the conversion to UnicodeString for me?

Torritorricelli answered 16/9, 2013 at 15:42 Comment(0)
A
8

If you are using an up-to-date version of Indy 10, then the overloaded version of TIdHTTP.Post() that returns a String does decode the data to Unicode, however the actual charset used for the decoding depends on what media type the HTTP Content-Type response header specifies:

  1. if the media type is either application/xml, application/xml-external-parsed-entity, application/xml-dtd, or is not a text/... type but does end with +xml, then the charset specified in the encoding attribute of the XML's prolog is used. If no charset is specified, UTF-8 is used.

  2. otherwise, if the Content-Type response header specifies a charset, then it is used.

  3. otherwise, if the media type is a text/... type, then:

    a. if the media type is text/xml, text/xml-external-parsed-entity, or ends with +xml, then us-ascii is used.

    b. otherwise ISO-8859-1 is used.

  4. otherwise, Indy's default encoding (ASCII by default) is used.

Without seeing the actual HTTP Content-Type header, it is hard to know which condition your situation falls into. It sounds like it is falling into either #2 or #3b, which would account for the UTF-8 byte values being returned as-is, if ISO-8859-1 or similar charset is being used.

UTF8ToString() expects a UTF-8 encoded RawByteString as input, but you are passing it a UTF-16 encoded UnicodeString instead. The RTL will perform a UTF16->Ansi conversion in that situation, using a default Ansi charset for the conversion. That is why you get the compiler warning, because such a conversion can lose data.

XML is really a binary data format, subject to charset encodings. An XML parser needs to know what the XML's encoding is, and be able to parse the raw encoded bytes accordingly. That is why XML has an explicit encoding attribute right in the XML prolog. However, when TIdHTTP downloads XML as a String, although it does automatically decode it to Unicode, it does not yet update the XML's prolog accordingly.

The real solution is to not download XML as a String in the first place. Download it as a TStream instead (TMemoryStream is a better choice than TStringStream) so your XML parser has access to the original bytes, the original charset declaration, etc. You can pass the TStream to the TXMLDocument.LoadFromStream() method, for instance.

Acadia answered 16/9, 2013 at 16:31 Comment(1)
Hi Remy thanks for your clear answer. After inspecting the HTTP reponse header I saw that there is no Charset specified, so in my case it was #3b.Torritorricelli
S
4

You can do this:

var
  sstream: TStringStream;
begin
  sstream := TStringStream.Create('', TEncoding.UTF8);
  try
    IdHttp.Post(MyUrL, AStream, sstream);
    DoSomethingWith(sstream.DataString);
  finally
    sstream.Free;
  end;
Sal answered 16/9, 2013 at 16:1 Comment(2)
This only works if you know ahead of time that response is always UTF-8.Acadia
Hi Marko thank you for your answer, this is actually the solution I was looking for because parsing the XML is not needed in my specific case (and I KNOW that I will have UTF-8 as response). I Accepted Remy's answer because it is the most correct answer :) .Torritorricelli

© 2022 - 2024 — McMap. All rights reserved.