What are the consequences of storing a C# string (UTF-16) in a SQL Server nvarchar (UCS-2) column?
Asked Answered
S

2

9

It seems that SQL Server uses Unicode UCS-2, a 2-byte fixed-length character encoding, for nchar/nvarchar fields. Meanwhile, C# uses Unicode UTF-16 encoding for its strings (note: Some people don't consider UCS-2 to be Unicode, but it encodes all the same code points as UTF-16 in the Unicode subset 0-0xFFFF, and as far as SQL Server is concerned, that's the closest thing to "Unicode" it natively supports in terms of character strings.)

While UCS-2 encodes the same basic code points as UTF-16 in the Basic Multilingual Plane (BMP), it doesn't reserve certain bit patterns that UTF-16 does to allow for surrogate pairs.

If I write a C# string to an SQL Server nvarchar (UCS-2) field and read it back, will this always return the same result?

It seems that while UTF-16 is a superset of UCS-2 in the sense that UTF-16 encodes more code points (e.g. above 0xFFFF), it's actually a sub-set of UCS-2 at the 2-byte level, since it's more restrictive.

To answer my own question, I suspect that if my C# string contains code points above 0xFFFF (represented by pairs of characters), these would be stored and retrieved just fine in the database, but if I tried to manipulated them in the database (e.g. perhaps calling TOUPPER or attempting to blank-out every other character), then I could run into some problems displaying the string later... unless SQL Server has functions that acknowledge surrogate pairs and effectively treat nchar/nvarchar strings as UTF-16.

Satirical answered 13/4, 2011 at 20:36 Comment(0)
F
4

It's all a bit of a fudge really.

First the similarities

  • The SQL Server nchar/nvarchar/ntext data types store text as a string of 2-byte characters. It doesn't really care what you put in them until you come to do searching and sorting (then it uses the appropriate Unicode collation sequence).
  • The CLR String data type also stores text as a string of 2-byte Chars. It also doesn't really care what you put in it until you come to do searching and sorting (then it uses the appropriate culture-specific methods).

Now the differences

  • .NET allows you to access the actual Unicode code points in a CLR string via the StringInfo class.
  • .NET has tons of support for encoding and decoding text data in a variety of encodings. When converting an arbitrary byte stream to a String, it will always encode the string as UTF-16 (with full multilingual plane support).

In short, as long as you treat both CLR and SQL Server string variables as whole blobs of text, then you can freely assign from one to the other with no loss of information. The underlying storage format is exactly the same, even though the abstractions layered on top are slightly different.

Fante answered 13/4, 2011 at 20:48 Comment(4)
Ok, so reading/writing a string as a whole entity to a nvarchar field will not cause problems or loss of information, even if it contains what would be interpretted as surrogate pairs. Now, what about writing a C# string to a char column? I suspect that WOULD involve some interpretation and conversion and would cause data loss...Satirical
Single-byte columns have a non-Unicode collation sequence defined on them, which not only defines searching and sorting rules, but also the code page that defines which characters are allowed. Any Unicode code point that is mapped to a value in the column's code page will be preserved, and the rest will be discarded.Fante
Discarded... or replaced with a particular dummy or "non character" byte? Do single-byte code pages reserve a certain byte for non-characters? I've seen some examples showing that Unicode characters not defined in the target code space are replaced with the question mark, but perhaps that just how non-characters are displayed?Satirical
@Satirical - To answer this question I modified the program in my answer by changing the parameter data type from nvarchar to varchar and SQL Server did, indeed, replace all the unmapped characters with U+003F QUESTION MARK.Glorious
G
4

I don't expect that treating the text as UCS-2 would cause many problems.

Case conversions should not be a problem, because (AFAIK) there are no case mappings above the BMP (except the identity mapping, of course!), and, obviously, the surrogate characters are going to map to themselves.

Blanking every other character is just asking for trouble. In reality, doing these sorts of transformations without consideration of the character values is always a dangerous activity. I can see it happening legitimately with string truncations. But if any unmatched surrogates show up in the result, this itself is not a huge problem. Any system that receives such data—and cares—will probably just replace the unmatched surrogate with a replacement character, if it bothers to do anything about it at all.

Obviously, string length is going to be bytes/2 rather than number-of-characters, but number-of-characters is not a very useful value anyway, once you start plumbing the depths of the Unicode code charts. For example, you aren't going to get good results in monospaced display once you leave the ASCII range, because of combining characters, RTL languages, directional control characters, tags, and several kinds of space characters. The high code points are going to be the least of your problems.

Just to be on the safe side, you should probably store your cuneiform texts in a different column than the archeologist's names. :D

UPDATE now with empirical data!

I just ran a test to see what happens with case transformations. I created a string with the English word TEST in uppercase twice—first in Latin script, then in Deseret script. I applied a lower-case transformation to this string in .NET and in SQL Server.

The .NET version correctly lowercased all the letters in both scripts. The SQL Server version only lowercased the Latin characters and left the Deseret characters unchanged. This meets with expectations regarding the handling of UTF-16 verses UCS-2.

using System;
using System.Data.SqlClient;

class Program
{
    static void Main(string[] args)
    {
        string myDeseretText = "TEST\U00010413\U00010407\U0001041D\U00010413";
        string dotNetLower = myDeseretText.ToLower();
        string dbLower = LowercaseInDb(myDeseretText);

        Console.WriteLine("  Original: {0}", DisplayUtf16CodeUnits(myDeseretText));
        Console.WriteLine(".NET Lower: {0}", DisplayUtf16CodeUnits(dotNetLower));
        Console.WriteLine("  DB Lower: {0}", DisplayUtf16CodeUnits(dbLower));
        Console.ReadLine();
    }

    private static string LowercaseInDb(string value)
    {
        SqlConnectionStringBuilder connection = new SqlConnectionStringBuilder();
        connection.DataSource = "(local)";
        connection.IntegratedSecurity = true;
        using (SqlConnection conn = new SqlConnection(connection.ToString()))
        {
            conn.Open();
            string commandText = "SELECT LOWER(@myString) as LoweredString";
            using (SqlCommand comm = new SqlCommand(commandText, conn))
            {
                comm.CommandType = System.Data.CommandType.Text;
                comm.Parameters.Add("@myString", System.Data.SqlDbType.NVarChar, 100);
                comm.Parameters["@myString"].Value = value;
                using (SqlDataReader reader = comm.ExecuteReader())
                {
                    reader.Read();
                    return (string)reader["LoweredString"];
                }
            }
        }
    }

    private static string DisplayUtf16CodeUnits(string value)
    {
        System.Text.StringBuilder sb = new System.Text.StringBuilder();

        foreach (char c in value)
            sb.AppendFormat("{0:X4} ", (int)c);
        return sb.ToString();
    }
}

Output:

  Original: 0054 0045 0053 0054 D801 DC13 D801 DC07 D801 DC1D D801 DC13
.NET Lower: 0074 0065 0073 0074 D801 DC3B D801 DC2F D801 DC45 D801 DC3B
  DB Lower: 0074 0065 0073 0074 D801 DC13 D801 DC07 D801 DC1D D801 DC13

Just in case anyone has a Deseret font installed, here are the actual strings for your enjoyment:

  Original: TEST𐐓𐐇𐐝𐐓
.NET Lower: test𐐻𐐯𐑅𐐻
  DB Lower: test𐐓𐐇𐐝𐐓
Glorious answered 13/4, 2011 at 20:49 Comment(7)
Thanks for the response. I disagree that case conversions wouldn't be a problem. For example, calling TOUPPER on a string in the database would produce a different byte-sequence than calling ToUpper on a string in C#, precisely because if a surrogate pair is present, the TSQL TOUPPER will upper-case each 2-byte sequence of the pair individually (so the second 2-byte sequence would fall in the BMP 0-0xFFFF range and potentially be uppercased), whereas the CLR String.ToUpper would probably take the surrogate pair into consideration and produce a new pair representing the upper-case letter.Satirical
I could probably ask an entirely different question such as "What string transforms are surrogate neutral?". Changing case, finding character length, comparing/sorting the string, reversing it, etc. would probably not be surrogate neutral, but what about trimming? I think perhaps there are none, which is why I agree with your statement that "doing these sorts of transformations without consideration of the character values is always a dangerous activity".Satirical
@Satirical - The surrogate code points are allocated specifically so that they would be transparent in UCS-2. Attempting to uppercase either a leading surrogate or a trailing surrogate will always map back to the original character, because there are no case conversion defined for those code points. If we assume that there are case conversions defined in the high planes (which I doubt), then the CLR and TSQL will perform the conversion differently, but neither operation will produce junk data (since TSQL will leave those characters unchanged). ...Glorious
The only problems this would cause would be with ordinal comparisons of strings that were independently transformed in a culture-sensitive way in .NET and TSQL. But these problems are of the same extent and variety as those that are always present when doing ordinal string comparisions, which is to say, such comparisons should never be done in the first place on this sort of data.Glorious
Ok, so your logic is... The algorithmic transform of a code point to a high/low surrogate produces a code point in the range 0xD800 to 0xDFFF, which never represents a character in Plane 0. Since UCS-2 is like UTF-16, such a code point in UCS-2 would also NOT represent a character in Plane 0. Therefore, as long as no case-conversions are defined above plane zero, a high/low-surrogate (in the range 0xD800 to 0xDFFF) would never be affected by a case transform.Satirical
@Satirical - Yes, that is my logic. I just checked the Unicode database CaseFolding.txt file, and it turns out that there are case conversions in the range U+10400–U+1044F for the Deseret script. So if your database contains any Deseret script, then any case transformations will not be performed on that data. I think I will test that later today to see what would happen. I know that the SQL Server team likes to do things their own way, and they have separate coalition code from Windows, but it would be pretty strange if they didn't use the Windows API for casing. I shall see.Glorious
@Satirical - I have updated my answer with the results of the test. It is interesting. The SQL Server team did roll their own casing code after all. It's UCS-2 through and through.Glorious

© 2022 - 2024 — McMap. All rights reserved.