What character encoding should I use for a HTTP header?
Asked Answered
W

2

160

I'm using a "fun" HTML special-character (✰)(see http://html5boilerplate.com/ for more info) for a Server HTTP-header and am wondering if it is "allowed" per spec.

  • Using the Network Tab in the dev tools in Chrome on Windows Xp Pro SP 3 I see the ✰ just fine.

  • In IE8 the ✰ is not rendered correctly.

  • The w3.org HTML validator does not render it correctly (displays "â°" instead).

Now, I'm not too keen on character encodings ... and frankly I don't really care too much about them; I just blindly use UTF-8 cus I'm told to. :-)


Is the disparity caused by bugs in the different parsers/browses/engines/(whatever-they-are-called)?

Is there a spec for this or maybe a list of allowed characters for an HTTP-header "value"?

Wareing answered 9/12, 2010 at 16:35 Comment(3)
This question would be much better asked generally: "Which characters are allowed in an http header value"Grafton
related: What encoding should I use for HTTP Basic Authentication?Mccurry
"Now, I'm not too keen on character encodings ... and frankly I don't really care too much about them; I just blindly use UTF-8 cus I'm told to. :-)" <----- Obligatory link to joelonsoftware.com/2003/10/08/…Bona
V
154

In short: Only ASCII is guaranteed to work. Some non-ASCII bytes are allowed for backwards compatibility, but are not supposed to be displayable.

HTTPbis gave up and specified that in the headers there is no useful encoding besides ASCII:

Historically, HTTP has allowed field content with text in the ISO-8859-1 charset [ISO-8859-1], supporting other charsets only through use of [RFC2047] encoding. In practice, most HTTP header field values use only a subset of the US-ASCII charset [USASCII]. Newly defined header fields SHOULD limit their field values to US-ASCII octets. A recipient SHOULD treat other octets in field content (obs-text) as opaque data.


Previously, RFC 2616 from 1999 defined this:

Words of *TEXT MAY contain characters from character sets other than ISO- 8859-1 [22] only when encoded according to the rules of RFC 2047 [14].

and RFC 2047 is the MIME encoding, so it'd be:

=?UTF-8?Q?=E2=9C=B0?=

but I don't think that many (if any) clients support it.

Victim answered 10/12, 2010 at 15:23 Comment(4)
so, what does that mean? Is "✰" valid/allowed?Wareing
To expand a bit on a very useful answer: "UTF-8" is the character set, and "Q" means the value will be "quoted-printable". "B" could also be used if you wanted to BASE64-encode the value.Longsighted
@porneL, So what does "opaque data" mean? What exactly should the HTTP recipient do when it receives these "opaque data"?Laith
@Laith "opaque data" means it's a black box with a bunch of bytes that applications shouldn't try to display or interpret (like binary data). What happens with it depends on the header, might range from "nothing" to "discard".Victim
V
10

Please read comments first, this answer likely draws wrong conclusions from the right sources, needs edit.


You can use any printable ASCII chars, and no special chars like ✰ (Which is not ASCII)

Tip: you can encode anything in JSON.

Edit: may not be obvious at first, the character encoding defined in the header only applies for the response body, not for the header itself. (As it would cause a chicken-&-egg problem.)


I'd like to sum up all the relevant definitions as per the spec linked by Penchant.

message-header = field-name ":" [ field-value ]
field-name     = token
field-value    = *( field-content | LWS )

So, we are after field-value.

LWS            = [CRLF] 1*( SP | HT )
CRLF           = CR LF
CR             = <US-ASCII CR, carriage return (13)>
LF             = <US-ASCII LF, linefeed (10)>
SP             = <US-ASCII SP, space (32)>
HT             = <US-ASCII HT, horizontal-tab (9)>

LWS stands for Linear White Space. Essentially, LWS is Space or Tab, but you can break your field-value into multiple lines by starting a new line before a Space or Tab.

Let's simplify it to this:

field-value    = <any field-content or Space or Tab>

Now we are after field-content.

field-content  = <the OCTETs making up the field-value
                 and consisting of either *TEXT or combinations
                 of token, separators, and quoted-string>
OCTET          = <any 8-bit sequence of data>
TEXT           = <any OCTET except CTLs,
                 but including LWS>
CTL            = <any US-ASCII control character
                 (octets 0 - 31) and DEL (127)>
token          = 1*<any CHAR except CTLs or separators>
separators     = "(" | ")" | "<" | ">" | "@"
                 | "," | ";" | ":" | "\" | <">
                 | "/" | "[" | "]" | "?" | "="
                 | "{" | "}" | SP | HT

TEXT is the most general and includes all the rest -so forget about the rest-. Here is the US-ASCII charset (= ASCII)

As you can see, all printable ASCII chars are allowed.

Vatican answered 19/6, 2012 at 19:45 Comment(5)
You are contradicting the passages you quoted. Why do you say "and no special chars like ✰"? Special characters are just OCTETs, and Since TEXT is any OCTET except 0 - 31, this means that all the OCTETs from 32 to 255 are allowed. The octets of ✰ are 226, 156, and 176 and all three of them are allowed, therefore ✰ is allowed according to the passages you quoted.Laith
@Laith you seem completely right, I don't see why I drew the conclusion I did.Vatican
@Laith yet I'm not ready to edit it as I needed to double check the spec again. I'm afraid additional details are restricting to the US-ASCII charset which in turn would support the conclusion yet render the reasoning insufficient.Vatican
Saying "you can encode anything in JSON" is a bit misleading. JSON allows for Unicode characters, whereas, HTTP headers should be US-ASCII. Unicode characters would be treated as "opaque" data and thus the behavior is undefined by the HTTP specification. That being said, JSON can be made safe for inclusion in a HTTP header by escaping the Unicode characters via the \uXXXX escape sequence.Biosynthesis
@zupa, Another issue is... what does "except CTLs" mean? Does it mean the characters CR, LF are allowed? Or does it mean only the continuous sequence "CR LF SP/HT" is allowed? (In other words, can header values contain a single CR or LF or HT? Can header values contain the characters CR, LF, and HT in any order and amount?)Laith

© 2022 - 2024 — McMap. All rights reserved.