SQL Server: set character set (not collation)

Asked 15/10, 2011 at 22:35 Answered 3/2, 2017 at 18:31

Solved sql-server character-encoding collation character-set

How does one set the default character set for fields when creating tables in SQL Server? In MySQL one does this:

CREATE TABLE tableName (
    name VARCHAR(128) CHARACTER SET utf8
) DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;

Note that I set the character set twice here. It is redundant, I added both ways just to demonstrate.

I set the collation also to demonstrate that collation is something different. I am not asking about setting the collation. Most questions asking about character sets and encodings in SQL Server are answered with collation, which is not the same thing.

Taken answered 15/10, 2011 at 22:35 Comment(3)

They are the same thing in SQL Server. By setting the collation on a varchar column you also set the code page. – Dx 15/10, 2011 at 23:8

Thank you Martin. Where is that documented? Of course I went through the fine manual (MSDN online) but I see no mention of it. – Taken 16/10, 2011 at 9:12

Collations control the physical storage of character strings in SQL Server. A collation specifies [ both ] the bit patterns that represent each character and the rules by which characters are sorted and compared. Link – Dx 16/10, 2011 at 9:29

As stated in BOL

Each SQL Server collation specifies three properties:

The sort order to use for Unicode data types (nchar, nvarchar, and ntext). A sort order defines the sequence in which characters are sorted, and the way characters are evaluated in comparison operations.

The sort order to use for non-Unicode character data types (char, varchar, and text).

The code page used to store non-Unicode character data.

The quote above is from 2000 docs. See also this 2008 link. The below also demonstrates this.

DECLARE @T TABLE 
(
     code TINYINT PRIMARY KEY,
     Arabic_CS_AS CHAR(1) COLLATE Arabic_CS_AS NULL,
     Cyrillic_General_CS_AS CHAR(1) COLLATE Cyrillic_General_CS_AS NULL,
     Latin1_General_CS_AS CHAR(1) COLLATE Latin1_General_CS_AS NULL
);

INSERT INTO @T(code) VALUES (200),(201),(202),(203),(204),(205)

UPDATE @T 
  SET Arabic_CS_AS=CAST(code AS BINARY(1)),
      Cyrillic_General_CS_AS=CAST(code AS BINARY(1)),
      Latin1_General_CS_AS=CAST(code AS BINARY(1))

SELECT * 
FROM @T

Results

code Arabic_CS_AS Cyrillic_General_CS_AS Latin1_General_CS_AS
---- ------------ ---------------------- --------------------
200  ب            И                      È
201  ة            Й                      É
202  ت            К                      Ê
203  ث            Л                      Ë
204  ج            М                      Ì
205  ح            Н                      Í

Dx answered 16/10, 2011 at 10:6 Comment(5)

Thank you Martin. It is unfortunate that they chose the misleading/incomplete term "collation" which clearly refers to sort order: collate definition. It also seems that one cannot thus use a custom collation (I have an unrelated PHP/MySQL app with a custom collation) with this setup. By the way, I love the elegant example! – Taken 16/10, 2011 at 11:37

@Taken - You can use an explicit collate clause to use different comparison semantics but you can't define your own collation rules. – Dx 16/10, 2011 at 12:16

@Martin Smith Your answer is grate.... the all issue depends from the moment of Data Base creation... it is very important to select the proper collation.. – Keats 15/11, 2014 at 1:38

@Taken and Martin: please see my answer for additional details regarding character sets and encodings :-). +1 for showing the variations across 3 code pages / collations. – Phan 3/2, 2017 at 19:24

@Taken and Martin: please see the UPDATE at the bottom of my answer. There is new info as of SQL Server 2019. – Phan 2/10, 2018 at 16:5

To expand on @Martin's answer:

How you set a "character set" in SQL Server depends on the datatype that you are using. If you are using:

NVARCHAR, NCHAR, and NTEXT (NTEXT is deprecated and shouldn't be used as of SQL Server 2005) all use the Unicode character set and this cannot be changed. These datatypes are all encoded as UTF-16 LE (Little Endian) – a 16-bit encoding with each "character" being either 2 or 4 bytes – and this too cannot be changed. For these datatypes, the Collation being used only affects the locale (as determined by the LCID of the Collation) which determines the set of rules used for sorting and comparison.
XML, like the N-prefixed types, uses the Unicode character set and is encoded as UTF-16 LE (Little Endian), and neither of those can be changed. But unlike the other string datatypes, there is no Collation associated with XML data as it cannot be sorted or compared (at least not without first converting it to NVARCHAR(MAX) [preferred] or VARCHAR(MAX) ).
VARCHAR, CHAR, and TEXT (TEXT is deprecated and shouldn't be used as of SQL Server 2005) are all 8-bit encodings with each "character" being either 1 or 2 bytes. The character set is determined by the Code Page associated with each Collation. The sorting and comparison rules depend on the type of Collation being used:
- SQL Server Collations: These all have names starting with SQL_ and have been deprecated since SQL Server 2000, though are (unfortunately) still in wide use today. These use simple rules indicated as the "SQL Server Sort Order" number as found in the description field returned by sys.fn_helpcollations().
- Windows Collations: These all have names that do not start with SQL_. These Collations allow the non-Unicode string data to use the Unicode sorting and comparison rules indicated by the LCID of the Collation.

That being said, to find out which character set (for CHAR, VARCHAR, and TEXT – i.e. non-Unicode – data) is being used, run the following query and pay close attention to the CodePage field. The LCID field indicates the locale used for sorting and comparison rules for the N-prefixed – i.e. Unicode – types as well as the non-Unicode types if using a Windows Collation:

SELECT *,
       COLLATIONPROPERTY(col.[name], 'CodePage') AS [CodePage],
       COLLATIONPROPERTY(col.[name], 'LCID') AS [LCID]
FROM   sys.fn_helpcollations() col
ORDER BY col.[name];

The Code Page IDs can be translated into something more meaningful via the MSDN page for Code Page Identifiers.

Regarding the O.P.'s comment on @Martin's answer:

It is unfortunate that they chose the misleading/incomplete term "collation" which clearly refers to sort order: collate definition.

While it is true that Microsoft could have done better when choosing a name, there is unfortunately a general, industry-wide confusion over terms such as "encoding", "character set", "collation", etc. Microsoft's use (or misuse) of "Collation" has merely contributed to the mass confusion. But, that confusion is also evident in MySQL as shown in this question, given that "utf8" is specifically not a character set ;-).

UTF-8 is one of several encodings for the Unicode character set. UTF-16 and UTF-32 are the other two encodings. All three of those encodings represent the exact same Unicode character set, just in different ways. Looking at the list of MySQL character sets – 11.1.10 Supported Character Sets and Collations – the "ucs2", "utf8", "utf8mb4", "utf16", "utf16le", "utf32" charsets are not actually character sets, per se, but various representations of the Unicode character set. But, given the overlap between the concepts of "character set" and "encoding", it would be difficult to not have this confusion. The 11.1.10.1 Unicode Character Sets page indicates that the "utf8mb4", "utf16", "utf16le", and "utf32" charsets are the full Unicode character sets while "ucs2" and "utf8" are subsets of the Unicode character set, specifically the first 65,536 code points (a.k.a. Basic Multilingual Plane (BMP)).

For more info regarding Collation across various RDBMS's, please see my answer to the following question on DBA.StackExchange:

Does any DBMS have a collation that is both case-sensitive and accent-insensitive?

UPDATE 2018-10-02

While this is not a viable option yet, SQL Server 2019 introduces native support for UTF-8 in VARCHAR / CHAR datatypes. There are currently too many bugs with it for it to be used, but if they are fixed, then this is an option for some scenarios. Please see my post, "Native UTF-8 Support in SQL Server 2019: Savior or False Prophet?", for a detailed analysis of this new feature.

Phan answered 3/2, 2017 at 18:31 Comment(1)

I feel this is a better explanation then what was originally accepted. – Resonant 31/10, 2017 at 0:7

Recommended topics

Hot tags