How can I know the character set of HTML content by HTTP headers?
Asked Answered
T

2

2

I know the parameter charset= in the HTTP header:Content-Type can be used to determine the character set of the HTML content. But if the parameter is missing in the Content-Type header, how can I know the character set of the HTML content?

I also know there is tag such as

"meta charset="utf-8""

in HTML that is used to specify the character set. But we get that tag only after parsing the HTML and parsing HTML needs to know the character set first.

Timon answered 3/6, 2017 at 13:36 Comment(2)
w3.org/TR/html5/…Conlen
You don't need to know the actual charset of the HTML in order to parse it. You just need to know if it is using 8-bit, 16-bit, or 32-bit characters (8-bit is the most common), and that is easy to determine after a few bytes. The HTML tags themselves are ASCII-compatible, so it is possible to read them once you know the character width being used. And then once you find a suitable <meta> tag, you will know how to interpret the textual data that is outside of the HTML tags.Giesecke
G
3

In the absence of an explicit charset attribute in the Content-Type header, different media types sent over different transports have different default character sets.

For instance, just to show a few definitions:

RFC 2046, Section 4.1.2 of the MIME specification says:

Unlike some other parameter values, the values of the charset parameter are NOT case sensitive. The default character set, which must be assumed in the absence of a charset parameter, is US-ASCII.

RFC 2616, Section 3.7.1 of the HTTP protocol specification says:

The "charset" parameter is used with some media types to define the character set (section 3.4) of the data. When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP. Data in character sets other than "ISO-8859-1" or its subsets MUST be labeled with an appropriate charset value. See section 3.4.1 for compatibility problems.

Which was later reversed by RFC 7231, Appendix B:

The default charset of ISO-8859-1 for text media types has been removed; the default is now whatever the media type definition says. Likewise, special treatment of ISO-8859-1 has been removed from the Accept-Charset header field. (Section 3.1.1.3 and Section 5.3.3).

RFC 3023, Sections 3.1, 3.3, 3.6, and 8.5 of the XML Media Types spec say:

Conformant with [RFC2046], if a text/xml entity is received with the charset parameter omitted, MIME processors and XML processors MUST use the default charset value of "us-ascii"[ASCII]. In cases where the XML MIME entity is transmitted via HTTP, the default charset value is still "us-ascii". (Note: There is an inconsistency between this specification and HTTP/1.1, which uses ISO-8859-1[ISO8859] as the default for a historical reason. Since XML is a new format, a new default should be chosen for better I18N. US-ASCII was chosen, since it is the intersection of UTF-8 and ISO-8859-1 and since it is already used by MIME.)

The charset parameter of text/xml-external-parsed-entity is handled the same as that of text/xml as described in Section 3.1.

The following list applies to text/xml, text/xml-external-parsed-entity, and XML-based media types under the top-level type "text" that define the charset parameter according to this specification:

...

  • If the charset parameter is not specified, the default is "us-ascii". The default of "iso-8859-1" in HTTP is explicitly overridden.

This example shows text/xml with the charset parameter omitted. In this case, MIME and XML processors MUST assume the charset is "us-ascii", the default charset value for text media types specified in [RFC2046]. The default of "us-ascii" holds even if the text/xml entity is transported using HTTP.

Omitting the charset parameter is NOT RECOMMENDED for text/xml. For example, even if the contents of the XML MIME entity are UTF-16 or UTF-8, or the XML MIME entity has an explicit encoding declaration, XML and MIME processors MUST assume the charset is "us-ascii".

RFC 7159, Sections 8.1 and 11, of the JSON specification says:

JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32. The default encoding is UTF-8, and JSON texts that are encoded in UTF-8 are interoperable in the sense that they will be read successfully by the maximum number of implementations; there are many implementations that cannot successfully read texts in other encodings (such as UTF-16 and UTF-32).

Implementations MUST NOT add a byte order mark to the beginning of a JSON text. In the interests of interoperability, implementations that parse JSON texts MAY ignore the presence of a byte order mark rather than treating it as an error.

Note: No "charset" parameter is defined for this registration. Adding one really has no effect on compliant recipients.

So, in general, if you want to know the charset used by a given resource, and that charset is not expressed through external means, like the charset attribute of a Content-Type header, then you have to determine what type of data you are dealing with, and then determine its charset based on how that data type's specification outlines.

In your case, you are dealing with HTML over HTTP, so the RFC 2616 rule applies to you. The HTML 5 spec, Section 8.2.2.2 defines a very detailed algorithm for determining the HTML's charset when no charset attribute is specified in the Content-Type header. That algorithm involves first checking for the presence of a UTF BOM, and if none is present then assume the HTML is 8-bit and parse it for any <meta> tags that contain character set or language declarations.

The XML 1.0 specification, Appendix F, also defines an algorithm that makes it easy to determine the character set used by the XML prolog, so you can read its Encoding attribute, if present, to determine the character set of the remaining XML.

Giesecke answered 7/6, 2017 at 20:39 Comment(0)
M
2

You're absolutely correct that you need to start parsing the HTML in order to see the <meta charset element.

But this is standardised behavior: you must follow an encoding sniffing algorithm which starts processing the HTML source until it knows the encoding, then reparses with the known encoding. Obviously this imposes limitations as you imagine, so you should check out the specification as per Quentin's comment as there are a lot of cases you need to be aware of.

Basically, your sniffer needs to be able to recognise UTF-16 byte order marks if the content may be UTF-16 (or UCS-2). And it needs to recognise "<!--" and "-->" in order to skip comments, and "<meta " or "<meta/" in order to identify the beginning of a meta element, which could use "http-equiv", "content" or "charset" tags.

When authoring HTML, you should ensure the <meta element is as early as possible in the file, within the first 1024 bytes, and preferably, before the first occurrence of any non-ASCII characters in the file.

Merritt answered 5/6, 2017 at 0:56 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.