How to get the length of Japanese characters in Javascript?

Asked 12/7, 2012 at 14:22 Answered 13/7, 2012 at 9:10

javascript unicode asp-classic cjk shift-jis

I have an ASP Classic page with SHIFT_JIS charset. The meta tag under the page's head section is like this:

<meta http-equiv="Content-Type" content="text/html; charset=shift_jis">

My page has a text box (txtName) that should only allow 200 characters. I have a Javascript function that validates the character length, which is called on the onclick() event of my Submit button.

if(document.frmPage.txtName.value.length > 200) {
  alert("You have exceeded the maximum length of 200.");
  return false;
}

The problem is, Javascript is not getting the correct length of Japanese character encoded in SHIFT_JIS. For example, the character 测 has a SHIFT_JIS length of 8 characters, but Javascript is only recognizing it as one character, probably because of the Unicode encoding that Javascript uses by default. Some characters like ケ have 2 or 3 characters when in SHIFT_JIS.

If I will only depend on the length provided by Javascript, long Japanese characters would pass the page validation and it will try to save on the database, which will then fail because of the 200 maximum length of the DB column.

The browser that I'm using is Internet Explorer. Is there a way to get the SHIFT_JIS length of the Japanese character using Javascript? Is it possible to convert from Unicode to SHIFT_JIS using Javascript? How?

Thanks for the help!

Teodor answered 12/7, 2012 at 14:22 Comment(7)

Is the database field not of a Unicode datatype then? Sounds to me like you are falling foul of a common ASP form character entoding Gotcha. – Leblanc 12/7, 2012 at 15:22

See: https://mcmap.net/q/1038840/-classic-asp-how-to-convert-a-utf-8-string-to-ucs-2 – Leblanc 12/7, 2012 at 15:27

hi. my database has an ISO_1 character set with collation SQL_Latin1_General_CP1_CI_AS. the DB field is of type NVARCHAR(200). – Teodor 13/7, 2012 at 1:53

to make my question clearer, I have another ASP classic page (Page2) that does the database saving. Page1 (indicated above) posts the inputted Japanese text to Page2, which also has a SHIFT_JIS encoding. I tried to use Response.Write Len(Request.Form("txtName")) in Page2 and the output is different from the length returned by Javascript using alert(document.frmPage.txtName.value.length). What I want to get in Javascript of Page1 is the length returned by Response.Write in Page2 in order to correctly handle the validation. thanks! :) – Teodor 13/7, 2012 at 1:54

In an NVARCHAR(200) the character you mention occupies 1 character. I now strongly suspect that you are falling foul of the character encoding problem. Please read my linked answer very carefully, despite it mentioning different encodings to the one you are dealing the principle still applies. Have you eyeballed the values stored in the table directly using SQL Server Manager, do they look as typed or do they appear garballed? Are you sure you have set the Response.CodePage of Page2.asp before you attempt access the Form field? – Leblanc 14/7, 2012 at 20:20

Hi Anthony. So here's what I tried: (1) In Page1.asp, I changed the charset to UTF-8; (2) In Page2.asp, I set the CodePage to 65001 and the charset to UTF-8; (3) I modified my insert and update statements by adding 'N' before the Japanese values to be saved. I was able to get the correct length and save them successfully, but I noticed that the Japanese data retrieved from the database are all question marks (????). I used Server.HTMLEncode(<data values>) and all worked fine, but is it the right way of retrieving the data? Thanks. – Teodor 16/7, 2012 at 3:20

Oops, I just forgot to add CodePage=65001 in the page that gets data from the database. Everything is working fine now using UTF-8. The characters are properly displayed even without Server.HTMLEncode. – Teodor 16/7, 2012 at 5:33

For example, the character 测 has a SHIFT_JIS length of 8 characters, but Javascript is only recognizing it as one character, probably because of the Unicode encoding

Let's be clear: 测, U+6D4B (Han Character 'measure, estimate, conjecture') is a single character. When you encode it to a particular encoding like Shift-JIS, it may very well become multiple bytes.

In general JavaScript doesn't make encoding tables available so you can't find out how many bytes a character will take up. If you really need to, you have to carry around enough data to work it out yourself. For example, if you assume that the input contains only characters that are valid in Shift-JIS, this function would work out how many bytes are needed by keeping a list of all the characters that are a single byte, and assuming every other character takes two bytes:

function getShiftJISByteLength(s) {
    return s.replace(/[^\x00-\x80｡｢｣､･ｦｧｨｩｪｫｬｭｮｯｰｱｲｳｴｵｶｷｸｹｺｻｼｽｾｿﾀﾁﾂﾃﾄﾅﾆﾇﾈﾉﾊﾋﾌﾍﾎﾏﾐﾑﾒﾓﾔﾕﾖﾗﾘﾙﾚﾛﾜﾝ ﾞ ﾟ]/g, 'xx').length;
}

However, there are no 8-byte sequences in Shift-JIS, and the character 测 is not available in Shift-JIS at all. (It's a Chinese character not used in Japan.)

Why you might be thinking it constitutes an 8-byte sequence is this: when a browser can't submit a character in a form, because it does not exist in the target charset, it replaces it with an HTML character reference: in this case 测. This is a lossy mangling: you can't tell whether the user typed literally 测 or 测. And if you are displaying the submitted content 测 as 测 then that means you are forgetting to HTML-encode your output, which probably means your application is highly vulnerable to cross-site scripting.

The only sensible answer is to use UTF-8 instead of Shift-JIS. UTF-8 can happily encode 测, or any other character, without having to resort to broken HTML character references. If you need to store content limited by encoded byte length in your database, there is a sneaky hack you can use to get the number of UTF-8 bytes in a string:

function getUTF8ByteLength(s) {
    return unescape(encodeURIComponent(s)).length;
}

although probably it would be better to store native Unicode strings in the database so that the length limit refers to actual characters and not bytes in some encoding.

Expert answered 13/7, 2012 at 9:10 Comment(7)

Thanks bobince for the answer. I tried to use UTF-8 charset (instead of Shift-JIS) in both pages but Japanese characters are not properly rendered, unless I change back to Shift-JIS charset. Is it possible to display Japanese characters on the page using UTF-8? – Teodor 13/7, 2012 at 9:56

Yes, "UTF" encodings can display all Unicode characters. Firstly make sure to save your document from a text editor as UTF-8 rather than Shift-JIS, to correct the static in-page text. Then you need to make sure you're talking to the database safely... not too familiar with SQL Server under Classic ASP, but this question suggests setting codepage 65001 (the Windows equivalent of UTF-8) might help. – Expert 13/7, 2012 at 16:33

@markuy: That could be because the content of the DB so far entered is already messed up by your receiving page not having Response.CodePage set to the correct value (using @ CODEPAGE would do that also) before your code accesses the form fields. Ultimately using NVARCHAR fields you really really should switch to UTF-8 everywhere it makes life so much simpler once you are up and runnning. – Leblanc 14/7, 2012 at 20:27

Hi anthony. Yup, the characters that are saved in the database are garbled and not properly written. I will double check my code this Monday and see if the SQL insert/update statements has 'N' before the values to be thrown to the database. Thanks. – Teodor 15/7, 2012 at 2:10

@Markuy: That would indicate a) you are still not looking in the right place for your problem and b) you are using string concatentation to build your inserts and updates so you potentially have a SQL injection vunerablity. Always use paramerterised queries. – Leblanc 16/7, 2012 at 12:30

@Leblanc everything is working fine now. thanks. my concatenated insert/update statements are actually inside a VB 6.0 DLL called by my ASP classic pages. – Teodor 17/7, 2012 at 1:3

@markuy: Glad you got things working but regardless of where the concatenation occurs if the data being concatenated ultimately arrives into the server from a third-party your application (and your customer) is at risk. – Leblanc 17/7, 2012 at 7:49

You are getting confused between characters and bytes. 测 is ONE character, however you look at it. In UTF-16 (which is what Javascript uses), it's two BYTES. In Shift_JIS, 8 bytes, apparently. But in both cases, it's ONE character. So what you are trying to do is limit the text length to 200 BYTES. Since Javascript is using UTF-16 (UCS-2, really) you can get it's byte length by multiplying the string length by 2, but that doesn't help you with Shift_JIS. Then again, you should probably consider switching to Unicode anyway, if you're working with Javascript...

Vowel answered 13/7, 2012 at 9:9 Comment(0)

Recommended topics

Hot tags