For HTTP responses with Content-Types suggesting character data, which charset should be assumed by the client if none is specified?
Asked Answered
B

6

12

If no charset parameter is specified in the Content-Type header, RFC2616 section 3.7.1 seems to imply ISO8859-1 should be assumed for media types of subtype "text":

When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP.

Data in character sets other than "ISO-8859-1" or its subsets MUST be labeled with an appropriate charset value.

However, I routinely see applications that serve up Javascript files with Content-Type values like "application/x-javascript" (i.e. no charset param), even when these scripts contain non-ASCII UTF-8 characters, which would be corrupt if interpreted as ISO8859-1.

This does not seem to pose problems to clients. How do clients know to interpret the bytes as UTF-8? Is there a rule for other character-data subtypes that implies UTF-8 should be the default? Where is this documented?

Buchheim answered 24/2, 2010 at 11:31 Comment(0)
S
15

All major browsers I've checked (IE, FF and Opera) completely ignore the RFC specification in this part.

If you are interested in the algorithm to auto-detect charset by data, look at Mozilla Firefox link.

Just a small note about content types: Only text has character sets. It's reasonable to assume that browsers handle application/x-javascript the same as they handle text/javascript ( except IE6, but that's another subject ).

Internet Explorer will use the default charset (probably stored at registry), as noted:

By default, Internet Explorer uses the character set specified in the HTTP content type returned by the server to determine this translation. If this parameter is not given, Internet Explorer uses the character set specified by the meta element in the document. It uses the user's preferences if no meta element is specified.

Source: http://msdn.microsoft.com/en-us/library/ms537500%28VS.85%29.aspx

Mozilla Firefox attempts to auto-detect the charset, as pointed here:

This paper presents three types of auto-detection methods to determine encodings of documents without explicit charset declaration.

Source: http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html

Opera uses auto-detection too, as documented:

If the transport protocol provides an encoding name, that is used. If not, Opera will look at the page for a charset declaration. If this is missing, Opera will attempt to auto-detect the encoding, using the domain name to see if the script is a CJK script, and if so which one. Opera can also auto-detect UTF-8.

Source: http://www.opera.com/docs/specs/opera9/

Selfdriven answered 27/2, 2010 at 22:47 Comment(0)
E
2

As described in RFC 4329, also application/javascript can have a charset parameter. The other question is the handling of browser implementations. Sorry, but not tested.

Eft answered 28/2, 2010 at 23:13 Comment(0)
G
2

In the absense of the charset parameter, the character encoding can be specified in the content. Here are some approaches taken by several content types:

HTML - Via the meta tag:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

HTML5 variant:

<meta charset="utf-8">

XML (XHTML, KML) - Via the XML declaration:

<?xml version="1.0" encoding="UTF-8"?>

Text - Via the Byte order mark. For example, for UTF-8 the first three bytes of a file in hexadecimal:

EF BB BF

As distinct from the character set associated with the document, note also that non-ASCII characters can be encoded via ASCII character sequences using various approaches:

HTML - Via character references:

&#nnnn;
&#xhhhh;

XML - Via character references:

&amp;
&defined-entity;

JSON - Via the escaping mechanism:

\u005C
\uD834\uDD1E

Now, with respect to the the HTTP 1.1 protocol, RFC 2616 says this about charset:

The "charset" parameter is used with some media types to define the character set (section 3.4) of the data. When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP. Data in character sets other than "ISO-8859-1" or its subsets MUST be labeled with an appropriate charset value. See section 3.4.1 for compatibility problems.

So, my interpretation of the above is that one cannot assume a default character set except for media subtypes of the type "text." Of course, we live in the real world and implementers do not always follow the rules. As described in the accepted answer, the various web browser vendors have implemented their own strategies for determining the document character set when it is not explicitly specified. One can assume that vendors of other clients (e.g., Google Earth) also implement their own strategies.

Gloria answered 10/10, 2013 at 14:35 Comment(2)
Character references or escapes have nothing to do at all with the character encoding of the enclosing document...Midget
@Julian - Agreed. I restructured my answer accordingly. (I do feel that including mention of character references and escaping is worthwhile.)Gloria
A
1

RFC 4329 defines the "application/javascript" media type as a replacement for "text/javascript", "application/x-javascript", and other similar types. Section 4.2 establishes the default character encoding to be UTF-8 when no explicit "charset" parameter is available and no Unicode BOM is present at the front of the data.

Acrid answered 5/3, 2010 at 2:47 Comment(1)
My interpretation of section 4.2 is not to assume that UTF-8 is the default character encoding. In addition, the intro to section 4 states: "How implementations determine the character encoding scheme can be subject to processing rules that are out of the scope of this document."Gloria
S
0

It's a bit special for XMLHttpRequest and is described here: http://www.w3.org/TR/XMLHttpRequest/

Stanger answered 24/2, 2010 at 12:27 Comment(0)
M
0

Pointing out the obvious: "application/x-javascript" is not a subtype of "text".

Also, the text in RFC 2616 is outdated. The next revision of HTTP/1.1 will not define a default. See RFC 6657 for further information.

Midget answered 24/2, 2010 at 13:27 Comment(4)
Agree - so the question is: Is there a rule for character-data subtypes other than "text"? If so, where is this documented?Buchheim
There is no general rule, as the media type might not be character based in the first place...Midget
The question is specifically about those media types that suggest character data. If there is no general rule, are there specific rules for different media types? Where are they documented? There must be at least some rules, given that clients have to make a decision on how to interpret the bytes.Buchheim
Specific rules should be in the document the media type registration points to, such as tools.ietf.org/html/rfc3023#section-3.2 for application/xml.Midget

© 2022 - 2024 — McMap. All rights reserved.