does anyone have an idea how to render unicode 'astral plane' characters (whose CIDs are beyond 0xffff) in google v8, the javascript vm that drives both google chrome and nodejs?
funnily enough, when i give google chrome (it identifies as 11.0.696.71, running on ubuntu 10.4) an html page like this:
<script>document.write( "helo" )
document.write( "𡥂 ⿸𠂇子" );
</script>
it will correctly render the 'wide' character 𡥂 alongside with the 'narrow' ones, but when i try the equivalent in nodejs (using console.log()
) i get a single � (0xfffd, REPLACEMENT CHARACTER) for the 'wide' character instead.
i have also been told that for whatever non-understandable reason google have decided to implement characters using a 16bit-wide datatype. while i find that stupid, the surrogate codepoints have been designed precisely to enable the 'channeling' of 'astral codepoints' through 16bit-challenged pathways. and somehow the v8 running inside of chrome 11.0.696.71 seems to use this bit of unicode-foo or other magic to do its work (i seem to remember years ago i always got boxes instead even on static pages).
ah yes, node --version
reports v0.4.10
, gotta figure out how to obtain a v8 version number from that.
update i did the following in coffee-script:
a = String.fromCharCode( 0xd801 )
b = String.fromCharCode( 0xdc00 )
c = a + b
console.log a
console.log b
console.log c
console.log String.fromCharCode( 0xd835, 0xdc9c )
but that only gives me
���
���
������
������
the thinking behind this is that since that braindead part of the javascript specification that deals with unicode appears to mandate? / not downright forbid? / allows? the use of surrogate pairs, then maybe my source file encoding (utf-8) might be part of the problem. after all, there are two ways to encode 32bit codepoints in utf-8: one is two write out the utf-8 octets needed for the first surrogate, then those for the second; the other way (which is the preferred way, as per utf-8 spec) is to calculate the resulting codepoint and write out the octets needed for that codepoint. so here i completely exclude the question of source file encoding by dealing only with numbers. the above code does work with document.write()
in chrome, giving 𐐀𝒜
, so i know i got the numbers right.
sigh.
EDIT i did some experiments and found out that when i do
var f = function( text ) {
document.write( '<h1>', text, '</h1>' );
document.write( '<div>', text.length, '</div>' );
document.write( '<div>0x', text.charCodeAt(0).toString( 16 ), '</div>' );
document.write( '<div>0x', text.charCodeAt(1).toString( 16 ), '</div>' );
console.log( '<h1>', text, '</h1>' );
console.log( '<div>', text.length, '</div>' );
console.log( '<div>0x', text.charCodeAt(0).toString( 16 ), '</div>' );
console.log( '<div>0x', text.charCodeAt(1).toString( 16 ), '</div>' ); };
f( '𩄎' );
f( String.fromCharCode( 0xd864, 0xdd0e ) );
i do get correct results in google chrome---both inside the browser window and on the console:
𩄎
2
0xd864
0xdd0e
𩄎
2
0xd864
0xdd0e
however, this is what i get when using nodejs' console.log
:
<h1> � </h1>
<div> 1 </div>
<div>0x fffd </div>
<div>0x NaN </div>
<h1> �����</h1>
<div> 2 </div>
<div>0x d864 </div>
<div>0x dd0e </div>
this seems to indicate that both parsing utf-8 with CIDs beyond 0xffff
and outputting those characters to the console is broken. python 3.1, by the way, does treat the character as a surrogate pair and can print the charactr to the console.
NOTE i've cross-posted this question to the v8-users mailing list.