UCS-2 and SQL Server
Asked Answered
Q

2

6

While researching options for storing mostly-English-but-sometimes-not data in a SQL Server database that can potentially be quite large, I'm leaning toward storing most string data as UTF-8 encoded.

However, Microsoft chose UCS-2 for reasons that I don't fully understand which is causing me to second-guess that leaning. The documentation for SQL Server 2012 does show how to create a UTF-8 UDT, but the decision for UCS-2 presumably pervades SQL Server.

Wikipedia (which interestingly notes that UCS-2 is obsolete in favor of UTF-16) notes that UTF-8 is a variable-width character set capable of encoding any Unicode data point and that it provides the de facto standard encoding for interchange of Unicode text. So, it feels like any Unicode character can be represented in UTF-8, and since most text will be English, the representation will be nearly twice as compact as with UCS-2 (I know disk is "cheap", but disk cache isn't, and memory isn't in comparison to the data sizes I'm dealing with. Many operations degrade exponentially when the working set is larger than available RAM).

What problems might I encounter by swimming up the UCS-2 stream?

Quartile answered 25/1, 2012 at 18:22 Comment(9)
Sql Server 2012 on Windows 2012 will support up to 4TB of ram. Generally it is cheaper to buy RAM than employ someone to improve the code. This is just a suggestion.Sideways
I asked the question while deciding on the architecture for a new system. That's the ideal time to ask such questions. Whether RAM or coding effort is cheaper depends on the scale of your problem. If Google could employ a single programmer for a few weeks to cut their disk and RAM needs in half, that would be a wise investment.Quartile
OK good point :-) I am not working on such large systems. For me such questions are easy. :-)Sideways
@Sideways Be careful of approaching problems be simply looking at hardware costs relative to developer costs. Far too often adding some amount of RAM and/or disk will only fix one part of the system, but the impact of some of these decisions touches many other parts that people often don't think about (backups, cost of SAN storage, developer time spent debugging and supporting issues when the "more RAM" fix is out grown, etc). Please see my answer here: What are the main performance differences between varchar and nvarchar SQL Server data types?.Grizel
@srutzky This is not a hard question. As you pointed out SQL compression removes size as an issue, which is why I upvoted that answer. But even without that: "Sometimes not English" means it must be ntext if you wish to search or index the field. (If you don't it is just a blob can stored however). And more RAM helps everything, so that is also not a hard question. Every other solution is more costly to implement and a maintenance headache. If your motivation is cost of RAM then look very very hard at those costs. 1 week wages buys you 64 to 128GB ram.Sideways
@Sideways I am not saying to not get more RAM ;-). But it seems like you are either assuming that someone is already on Enterprise Edition, or discounting the cost of it. Data Compression (including Unicode compression) is only available in Enterprise Edition, and that is a hefty price tag for someone who is on Standard, especially if they have a lot of cores (way more than I make in a week ;-). Also, even with Enterprise, Unicode compression doesn't work on NVARCHAR(MAX), and ROW / PAGE compression doesn't always help, or can't be used due to a high frequency of updates on the table or something.Grizel
@srutzky, sure but speed and space is the issue and whether using space or saving space costs money one way or another. Six months wages will buy you a maxed out Dell 700 series server with 192GB ram, SQL Standard 4 core licence (will use 128GB ram) and 2TB of SSD. Then you can get on with the rest of your job instead of working on performance issues all the time. :-) .... Anyway/ I don't think we really disagree, just emphasis :-)Sideways
Keep in mind, not every application deploys to a single server. When you're talking about enterprise grade SAN's for big data, developer hours can start to look cheap...Quartile
@Sideways and Eric: please see the UPDATE at the bottom of my answer. New info as of SQL Server 2019. Also, I updated the note about Data Compression since it became available in all editions as of SQL Server 2016, SP1.Grizel
G
14

storing mostly-English-but-sometimes-not data in a SQL Server database that can potentially be quite large, I'm leaning toward storing most string data as UTF-8 encoded.

Unlike some other RDBMS's that allow for choosing an encoding, SQL Server stores Unicode data only in UTF-16 (Little Endian), and non-Unicode data in an 8-bit encoding (Extended ASCII, DBCS, or EBCDIC) for whatever Code Page is implied by the Collation of the field.

Microsoft chose UCS-2 for reasons that I don't fully understand

Their decision to choose UCS-2 makes sense enough given that UTF-16 was introduced in mid-1996 and fully specified in 2000. A lot of other systems use (or used) it as well (please see: https://en.wikipedia.org/wiki/UTF-16#Usage). Their decision to continue with it might be more questionable, though it is probably due to Windows and .NET being UTF-16. The physical layout of the bytes is the same between UCS-2 and UTF-16, so upgrading systems from UCS-2 to support UTF-16 should be purely functional with no need to alter any existing data.

The documentation for SQL Server 2012 does show how to create a UTF-8 UDT,

Um, no. Creating a custom User-Defined Type via SQLCLR is not, in any way, going to get you a replacement of any native type. It is very handy for creating something to handle specialized data. But strings, even of a different encoding, are far from specialized. Going this route for your string data would destroy any amount of usability of your system, not to mention performance as you wouldn't be able to use any built-in string functions. If you were able to save anything on disk space, those gains would be erased by what you would lose in overall performance. Storing a UDT is done by serializing it to a VARBINARY. So in order to do any string comparison OR sorting, outside of a "binary" / "ordinal" comparison, you would have to convert all other values, one by one, back to UTF-8 to then do the string compare that can account of linguistic differences.

Also, that "documentation" is really just sample code / proof of concept stuff. The code was written in 2003 ( http://msftengprodsamples.codeplex.com/SourceControl/latest#Kilimanjaro_Trunk/Programmability/CLR/UTF8String/CS/UTF8String/Utf8String.cs ) for SQL Server 2005. I saw a script to test functionality, but nothing involving performance.

but the decision for UCS-2 presumably pervades SQL Server.

Yes, very much so. By default, the handling of the built-in functions is only for UCS-2. But starting in SQL Server 2012, you can get them to handle the full UTF-16 character set (well, as of Unicode Version 5 or 6, depending on your OS and version of the .NET Framework) by using one of the collations that has a name ending in _SC (i.e. Supplementary Characters).

Wikipedia ... notes that UCS-2 is obsolete in favor of UTF-16

Correct. UTF-16 and UCS-2 both use 2-byte code points. But UTF-16 uses some of them in pairs (i.e. Surrogate Pairs) to map additional characters. The code points used for these pairs are reserved for this purpose in UCS-2 and hence are not used to map to any usable symbols. This is why you can store any Unicode character in SQL Server and it will be stored and retrieved correctly.

Wikipedia ... notes that UTF-8 is a variable-width character set capable of encoding any Unicode data point

Correct, though misleading. Yes, UTF-8 is variable-width, but UTF-16 is also minorly variable since all of the Supplementary Characters are composed of two double-byte code points. Hence UTF-16 uses either 2 or 4 bytes per symbol, though UCS-2 is always 2 bytes. But that is not the misleading part. What is misleading is the implication that any other Unicode encoding isn't capable of encoding all other code points. While UCS-2 can hold them but not interpret them, both UTF-16 and UTF-32 can both map all Unicode code points, just like UTF-8.

and that it [ed: UTF-8] provides the de facto standard encoding for interchange of Unicode text.

This may be true, but it is entirely irrelevant from an operational perspective.

it feels like any Unicode character can be represented in UTF-8

Again, true, but entirely irrelevant since UTF-16 and UTF-32 also map all Unicode code points.

since most text will be English, the representation will be nearly twice as compact as with UCS-2

Depending on circumstances this could very well be true, and you are correct to be concerned about such wasteful usage. However, as I mentioned in the question that lead to this one ( UTF-8 Support, SQL Server 2012 and the UTF8String UDT ), you have a few options to mitigate the amount of space wasted if most rows can fit into VARCHAR yet some need to be NVARCHAR. The best option is to enable ROW COMPRESSION or PAGE COMPRESSION (Enterprise Editon only!). Starting in SQL Server 2008 R2, they allow non-MAX NVARCHAR fields to use the "Standard Compression Scheme for Unicode" which is at least as good as UTF-8, and in some cases it is even better than UTF-8. NVARCHAR(MAX) fields cannot use this fancy compression, but their IN ROW data can benefit from regular ROW and/or PAGE Compression. Please see the following for a description of this compression and a chart comparing data sizes for: raw UCS-2 / UTF-16, UTF-8, and UCS-2 / UTF-16 with data compression enabled.

SQL Server 2008 R2 - UCS2 compression what is it - Impact on SAP systems

Please also see the MSDN page for Data Compression for more details as there are some restrictions (beyond it being available only in Enterprise Edition -- BUT made available to all editions starting with SQL Server 2016, SP1 !!) and some circumstances when compression might make things worse.

I know disk is "cheap"

The veracity of that statement depends on how one defines "disk". If you are speaking in terms of commodity parts that you can purchase off the shelf at a store for use in your desktop / laptop, then sure. But, if speaking in terms of enterprise-level storage that will be used for your Production systems, then have fun explaining to whomever controls the budget that they shouldn't reject the million-plus-dollar SAN that you want because it is "cheap" ;-).

What problems might I encounter by swimming up the UCS-2 stream?

None that I can think of. Well, as long as you don't follow any horrible advice to do something like implementing that UDT, or converting all of the strings to VARBINARY, or using NVARCHAR(MAX) for all string fields ;-). But of all of the things you could worry about, SQL Server using UCS-2 / UTF-16 shouldn't be one of them.

But, if for some reason this issue of no native support for UTF-8 is super important, then you might need to find another RDBMS to use that does allow for UTF-8.


UPDATE 2018-10-02

While this is not a viable option yet, SQL Server 2019 introduces native support for UTF-8 in VARCHAR / CHAR datatypes. There are currently too many bugs with it for it to be used, but if they are fixed, then this is an option for some scenarios. Please see my post, "Native UTF-8 Support in SQL Server 2019: Savior or False Prophet?", for a detailed analysis of this new feature.

Grizel answered 30/9, 2015 at 6:50 Comment(2)
I just did some testing regarding exactly what was covered here. Would love if you could take a look, since it's about UCS-2 limitations on nvarchar fields: #63716260Hives
Wow, that's why SO is so vital and useful! I never realised before that SQL Server's NVARCHAR is actually UCS-2, not a full-blown UTF-16 encoding. Never stop learning, they say. Thank you @solomon-rutzky!Pictograph
S
0

What do you mean by "swimming up the UCS-2 stream"?

Here are your options:

Not recommended but possible:

  • Implement a UDT. This will be a lot of work and you will loose tolling support (OR mapping and certainly some SQL Server features that work on native types).
  • Use varbinary(max): Requires you to do custom conversion code. No range indexing.
  • Use nvarchar(N) and turn on row compression. Starting with SQL Server 2008 R2 this will use an encoding that is as compact as UTF-8. But this requires enterprise edition.

See the comments to read about the severe drawbacks that these approaches have.

Sparrowgrass answered 25/1, 2012 at 20:11 Comment(12)
By "swimming up stream", I mean fighting the design decision that MS has made to encode strings in UCS-2. Why do you recommend the (max) versions vs the (n) versions of column definitions? Isn't the (max) potentially slower? rusanu.com/2010/03/22/…Quartile
The different versions are equally fast provided that you put the same strings in them. If you put longer strings in the longer columns, you get, of course, worse performance. Internally, the count type parameter is implemented as an int16 which is prefixing every such column. I know of no case in which changing the max count changes performance.Sparrowgrass
My understanding is that the data for (MAX) variants is stored off of the row data, potentially causing another disk seek. I think that's the root of the performance numbers from the link I shared. Do you have a reference that supports performance being the same?Quartile
I have no reference at hand but why would a 1 byte string be stored off-row? Of course it will be stored off-row when it does not fit in-row.Sparrowgrass
@EricJ. and usr: this is truly horribly horrible advice. I rarely say anything that negative, but these "options" will destroy a system to varying degrees. The UDT, even with some cautions mentioned, would render a system unusable. There is no evidence that it would work in reality, especially at scale. Using VARBINARY(MAX) is more complicated code, a performance hit, all conversions need to be done at the app layer, indexes can only support binary comparisons, and no string operations at all! NVARCHAR(MAX) for strings IS slower than non-MAX, and it disallows using Unicode Compression.Grizel
@srutzky I just read your answer. I did not know about the _SC collations. This answer was before the 2012 release as it seems. I have pointed to your answer from mine. Hopefully, the accept mark will be adjusted as well.Sparrowgrass
Hey there. Thanks. I do get that this answer was posted prior to SQL Server 2012 going RTM, and wouldn't expect many people to know of the new _SC collations at that point (maybe not even now? ;-). But, just to be clear about those collations, they do not help with the main concern of the Question, which is about the size on disk and in memory of strings when some need Code Points above 255 but most do not. The new _SC collations don't reduce the storage, just the interpretation of what is stored (i.e. proper handling and weighting of Supplementary Characters).Grizel
And again, I am really not wanting to be harsh, I am just concerned about expectations that some readers will have upon considering these options. For example, storing in VARBINARY renders that data completely unintelligible to SQL Server in all ways. UTF-8 and UCS-2/UTF-16 are different mappings to the Code Points, so you couldn't convert to NVARCHAR to even see the value for debugging purposes. It would be the same if you just gzipped the values. And NVARCHAR(MAX) can't use Unicode Compression, which is your stated reason for suggesting NVARCHAR(MAX) ;-). Only NVARCHAR(1 - 4000) can.Grizel
@srutzky I don't recommend these options at all. I'm just pointing out that they are a valid way to solve the problem (and they are! even through the drawbacks are tough). I have added more warnings. If someone implements one of those now I can't be held responsible :)Sparrowgrass
Thanks for adding in the warnings. I think the only thing left to say is that again, regarding the 4th bullet, that one is not a valid way to solve the problem since using MAX invalidates the ability to use that special compression added in 2008 R2 (the compression that can do just as well or better than UTF-8). So that one is technically wrong, regardless of whether or not someone should try it.Grizel
@srutzky: If I recall correctly compression is only available for the Enterprise edition, which is far more expensive than other editions. It may still be viable for some very specific needs, though your new answer is better.Quartile
@EricJ. Yes, DATA COMPRESSION (row, page, and Unicode--which is implicit) is only available in Enterprise Edition. And to be fair, this answer did state that from the beginning.Grizel

© 2022 - 2024 — McMap. All rights reserved.