Storing UTF-16/Unicode data in SQL Server
Asked Answered
M

3

7

According to this, SQL Server 2K5 uses UCS-2 internally. It can store UTF-16 data in UCS-2 (with appropriate data types, nchar etc), however if there is a supplementary character this is stored as 2 UCS-2 characters.

This brings the obvious issues with the string functions, namely that what is one character is treated as 2 by SQL Server.

I am somewhat surprised that SQL Server is basically only able to handle UCS-2, and even more so that this is not fixed in SQL 2K8. I do appreciate that some of these characters may not be all that common.

Aside from the functions suggested in the article, any suggestions on best approach for dealing with the (broken) string functions and UTF-16 data in SQL Server 2K5.

Milestone answered 30/4, 2009 at 3:39 Comment(5)
What string functions are broken please?Lysander
LEN will return the number of UCS-2 characters in the string, not the number of UTF-16 characters. SUBSTRING will split UTF-16 characters in half. Same goes for LEFT and RIGHT. UPPER and LOWER would also probably break. REVERSE would definitely break. CHARINDEX and PATINDEX also. Not sure about DIFFERENCE and STUFF. So a lot of them....Milestone
Thanks for pointing this out. The fact that it doesn't support ALL Unicode characters means that some UTF-16 string values (e.g. from Windows or .NET) are not valid to dump into SQL Server without verification. In order for any application to be bug-free and technically correct (how RARE bug-causing characters are doesn't make a bit of difference as far as correctness goes), ALL strings must be validated to contain UCS-2-compatible characters before being stored in SQL Server. Wonderful! Way to make my job that much harder Microsoft.Twaddle
@Twaddle stripping out all but values 0 - 65535 is unnecessary, inappropriate, and probably foolish. Supplementary Characters are not the only characters to act in non-intuitive ways. Combining characters that are perfectly valid in UCS-2, and sort and compare correctly, also have issues with many / most of the built-in string functions.Brendanbrenden
SQL Server 2019 has finally introduced support for UTF-8 (techcommunity.microsoft.com/t5/sql-server-blog/…). In many situations, it's probably better to use UTF-8 and forget UTF-16 altogether.Attainture
A
9

SQL Server 2012 now supports UTF-16 including surrogate pairs. See http://msdn.microsoft.com/en-us/library/ms143726(v=sql.110).aspx, especially the section "Supplementary characters".

So one fix for the original problem is to adopt SQL Server 2012.

Attainture answered 11/10, 2012 at 2:57 Comment(2)
While true that SQL Server 2012 introduced the _SC collations which have proper handling of Supplementary Characters, the Question is very specific about pertaining to SQL Server 2005. Also, it is not "UTF-16 + surrogate pairs" since UTF-16 = "UCS-2 + surrogate pairs".Brendanbrenden
@SolomonRutzky, yes, that's why I said "including"Attainture
T
3

The string functions work fine with unicode character strings; the ones that care about the number of characters treat a two-byte character as a single character, not two characters. The only ones to watch for are len() and datalength(), which return different values when using unicode. They return the correct values of course - len() returns the length in characters, and datalength() returns the length in bytes. They just happen to be different because of the two-byte characters.

So, as long as you use the proper functions in your code, everything should work transparently.

EDIT: Just double-checked Books Online, unicode data has worked seemlessly with string functions since SQL Server 2000.

EDIT 2: As pointed out in the comments, SQL Server's string functions do not support the full Unicode character set due to lack of support for parsing surrogates outside of plane 0 (or, in other words, SQL Server's string functions only recognize up to 2 bytes per character.) SQL Server will store and return the data correctly, however any string function that relies on character counts will not return the expected values. The most common way to bypass this seems to be either processing the string outside SQL Server, or else using the CLR integration to add Unicode aware string processing functions.

Tinstone answered 30/4, 2009 at 3:50 Comment(13)
You have misunderstood the question. UTF-16 allows for supplementary characters. This works by storing a single character (from the user's perspective) in 2 code units, ie 4 bytes. UCS-2 does not handle supplementary characters. Hence the 4 bytes are treated as two characters by SQL Server when in fact that are one character.Milestone
That's only for characters outside the standard defined languages. The whitepaper states this is primarily for historical languages.Tinstone
Comment on the edit: SQL Server works fine on UCS-2 unicode data. UCS-2 is a deprecated standard, windows has used UTF-16 internally since Win2K.Milestone
Sure. But to offer Unicode 3.1 support, the full character set should be supported.Milestone
From the whitepaper: In the Unicode standard, there are 16 planes of characters, with the potential to define as many as 1,114,112 characters. Plane 0, or the Basic Multilingual Plane (BMP), can represent most of the world's written scripts, characters used in publishing, mathematical and technical symbols, geometric shapes, all level-100 Zapf Dingbats, and punctuation marks.Tinstone
Yes, but it does not support the full unicode character set.Milestone
I suspect the reason it sticks with UCS-2 rather than UTF-16 is that UCS-2 limits itself to two bytes in length, but is otherwise identical to UTF-16. This gives UCS-2 a high degree of compatibility with UTF-16, while also offering size consistency that makes the maximum sizes of char(8000 bytes) and nchar(4000 bytes) easier to enforce. Despite any justifications for sticking with UCS-2 over UTF-16, it indeed does NOT support surrogate pairs and therefore does not support the full Unicode character set, and that really really sucks.Twaddle
CLR integration will not fix this. If, in fact, a .NET string stores UTF-16 data and SQL server stores UCS-2 data, then the types are ultimately not fully compatible. In other words, if a UTF-16 string (with 4-byte characters) goes into SQL Server and comes back out unscathed, then the decoding process must be incorrect or overly complex and inconsistent. The only legitimate work-arounds are (1) stripping UTF-16 strings of incompatible characters or (2) reading and writing the string's original bytes and processing it as a string only outside of SQL Server.Twaddle
I want to add my voice to the comments: this answer is wrong and misleading. SQL Server only supports two-byte characters. UTF-16 has some four byte characters.Attainture
@Twaddle Do not confuse "storage" with "interpretation". The storage of UCS-2 and UTF-16 are identical since everything is in 2-byte blocks, and Supplementary Characters just happen to be two of those 2-byte blocks in a specific combination. Hence, SQL Server and .NET both store UTF-16 code points. So there is nothing wrong with the decoding process and it is not overly complex or inconsistent. Stripping out surrogate pairs would be needless data loss. And regarding proper handling of built-in string functions, Combining Characters also have "issues" ;-).Brendanbrenden
@ConcreteGannet With respect to only EDIT 2 (which at this point should be the only text of the answer, or at least everything prior should be wrapped in a <del> tag), this answer is not incorrect or misleading. In fact, it is your comment that is incorrect. SQL Server does support the UTF-16 encoding, it is just limited. And UTF-16 contains about as many 4-byte characters as 2-byte characters. Over time that will change as there are less than 63,000 addressable two-byte characters and over 1 million addressable four-byte characters (approx 60k are currently mapped).Brendanbrenden
@srutzky, yes, you could store and retrieve UTF-16 characters, but to me "support" should mean the character string functions in SQL Server interpret all of UTF-16 correctly too. That was improved in SQL Server 2012.Attainture
@ConcreteGannet "Support" is a spectrum. There is support for UTF-16 in non-_SC collations, it's just very limited. BUT, none of the collations "properly" handle combining characters that are valid UCS-2 / BMP Code Points. For example: DECLARE @Test NVARCHAR(10); SET @Test = N'te' + NCHAR(0x0301) + N'st'; SELECT NCHAR(55357)+NCHAR(56960) AS [WorksInAnyCollation], NCHAR(128640) AS [OnlyWorksIn_SC_Collations], @Test AS [TestValue], LEN(@Test) AS [Length], RIGHT(@Test, 3) AS [Oops]; The two NCHARs get you a proper Supplementary Character. And the question isn't asking about ideal support.Brendanbrenden
G
-2

something to add, that I just learned the hard way:

if you use an "n" field in oracle (im running 9i), and access it via the .net oracleclient, it seems that only parameterized sql will work... the N'string' unicode prefix doesnt seem to do the trick if you have some inline sql.

and by "work", I mean: it will lose any characters not supported by the base charset. So in my instances, english chars work fine, cyrillic turns into question marks/garbage.

this is a fuller discussion on the subject: http://forums.oracle.com/forums/thread.jspa?threadID=376847

Wonder if the ORA_NCHAR_LITERAL_REPLACE variable can be set in the connection string or something.

Grimaldi answered 6/2, 2010 at 0:44 Comment(3)
Hi boomhauer, the question was about Microsoft SQL Server. Your answer may be useful somewhere else.Attainture
wow... something happened here. did i post to the wrong question? I almost wonder if SO screwed this up, since it's been around since feb 2010...Grimaldi
in fact, i KNOW this answer used to be on another question!Grimaldi

© 2022 - 2024 — McMap. All rights reserved.