What is the point of COLLATIONS for nvarchar (Unicode) columns?

Asked 18/3, 2012 at 7:35 Answered 30/9, 2015 at 19:20

Solved sql-server sql-server-2008 unicode collation nvarchar

IVe read a lot about this.

Still some questions :

Im not talking about case sensitive here...

If I have a char (ש for example) and he is stored in nvarchar - which can hold anything , Why would I need collation here ?
If I'm "FaceBook" and i need the ability to store all chars from all languages , What is the relationship between the collation and my nvarchar columns ?

Thanks in advance.

Introduction answered 18/3, 2012 at 7:35 Comment(3)

A collation doesn't [directly] say what can be stored. A collation determines ordering and equality. For instance, some collations are case-insensitive or accent-insensitive while others are not. – Kirit 18/3, 2012 at 7:41

@pst if i stored ש and i make 'order by' so he should order by its chars which includes ש. .....no ? – Introduction 18/3, 2012 at 7:44

@Kirit "A collation doesn't [directly] say what can be stored." isn't entirely true. It is only true for NVARCHAR / NCHAR fields. For VARCHAR and CHAR fields the Collation does indeed determine what can be stored there as it contains the LCID which determines the Code Page. – Compagnie 30/9, 2015 at 17:42

Storing and representing characters is one thing, and knowing how to sort and compare them is another.

Unicode data, stored in the XML and N-prefixed types in SQL Server, can represent all characters in all languages (for the most part, and that is its goal) with a single character set. So for NCHAR / NVARCHAR data (I am leaving out NTEXT as it shouldn't be used anymore, and XML as it is not affected by Collations), the Collations do not change what characters can be stored. For CHAR and VARCHAR data, the Collations do affect what can be stored as each Collation points to a particular Code Page, which determines what can be stored in values 128 - 255.

Now, while there is a default sort order for all characters, that cannot possibly work across all languages and cultures. There are many languages that share some / many / all characters, but have different rules for how to sort them. For example, the letter "C" comes before the letter "D" in most alphabets that use those letters. In US English, a combination of "C" and "H" (i.e. "CH" as two separate letters) would naturally come before any string starting with a "D". But, in a few languages, the two-letter combination of "CH" is special and sorts after "D":

IF (   N'CH' COLLATE Czech_CI_AI > N'D' COLLATE Czech_CI_AI
   AND N'C'  COLLATE Czech_CI_AI < N'D' COLLATE Czech_CI_AI
   AND N'CI' COLLATE Czech_CI_AI < N'D' COLLATE Czech_CI_AI
   ) PRINT 'Czech_CI_AI';

IF (   N'CH' COLLATE Czech_100_CI_AI > N'D' COLLATE Czech_100_CI_AI
   AND N'C'  COLLATE Czech_100_CI_AI < N'D' COLLATE Czech_100_CI_AI
   AND N'CI' COLLATE Czech_100_CI_AI < N'D' COLLATE Czech_100_CI_AI
   ) PRINT 'Czech_100_CI_AI';

IF (   N'CH' COLLATE Slovak_CI_AI > N'D' COLLATE Slovak_CI_AI
   AND N'C'  COLLATE Slovak_CI_AI < N'D' COLLATE Slovak_CI_AI
   AND N'CI' COLLATE Slovak_CI_AI < N'D' COLLATE Slovak_CI_AI
   ) PRINT 'Slovak_CI_AI';

IF (   N'CH' COLLATE Slovak_CS_AS > N'D' COLLATE Slovak_CS_AS
   AND N'C'  COLLATE Slovak_CS_AS < N'D' COLLATE Slovak_CS_AS
   AND N'CI' COLLATE Slovak_CS_AS < N'D' COLLATE Slovak_CS_AS
   ) PRINT 'Slovak_CS_AS';

IF (   N'CH' COLLATE Latin1_General_100_CI_AS > N'D' COLLATE Latin1_General_100_CI_AS
   AND N'C'  COLLATE Latin1_General_100_CI_AS < N'D' COLLATE Latin1_General_100_CI_AS
   AND N'CI' COLLATE Latin1_General_100_CI_AS < N'D' COLLATE Latin1_General_100_CI_AS
   ) PRINT 'Latin1_General_100_CI_AS'
ELSE PRINT 'Nope!';

Returns:

Czech_CI_AI
Czech_100_CI_AI
Slovak_CI_AI
Slovak_CS_AS
Nope!

To see examples of sorting rules across various cultures, please see: Collation Charts.

Also, in some languages certain letters or combinations of letters equate to other letters in ways that they do not in most other languages. For example, only in Danish does a "å" equate to "aa". But, the "å" does not equate to just a single "a":

IF (N'aa' COLLATE Danish_Greenlandic_100_CI_AI =  N'å' COLLATE Danish_Greenlandic_100_CI_AI
AND N'a'  COLLATE Danish_Greenlandic_100_CI_AI <> N'å' COLLATE Danish_Greenlandic_100_CI_AI
   ) PRINT 'Danish_Greenlandic_100_CI_AI';

IF (   N'aa' COLLATE Danish_Norwegian_CI_AI =  N'å' COLLATE Danish_Norwegian_CI_AI
   AND N'a'  COLLATE Danish_Norwegian_CI_AI <> N'å' COLLATE Danish_Norwegian_CI_AI
   ) PRINT 'Danish_Norwegian_CI_AI';

IF (   N'aa' COLLATE Latin1_General_100_CI_AI =  N'å' COLLATE Latin1_General_100_CI_AI
   AND N'a'  COLLATE Latin1_General_100_CI_AI <> N'å' COLLATE Latin1_General_100_CI_AI
   ) PRINT 'Latin1_General_100_CI_AI'
ELSE PRINT 'Nope!';

Returns:

Danish_Greenlandic_100_CI_AI
Danish_Norwegian_CI_AI
Nope!

This is all highly complex, and I haven't even mentioned handling for right-to-left languages (Hebrew and Arabic), Chinese, Japanese, combining characters, etc.

If you want some deep insight into the rules, check out the Unicode Collation Algorithm (UCA). The examples above are based on examples in that documentation, though I do not believe all of the rules in the UCA have been implemented, especially since the Windows collations (collations not starting with SQL_) are based on Unicode 5.0 or 6.0, depending on the which OS you are using and the version of the .NET Framework that is installed (see SortVersion for details).

So that is what the Collations do. If you want to see all of the Collations that are available, just run the following:

SELECT [name] FROM sys.fn_helpcollations() ORDER BY [name];

Compagnie answered 30/9, 2015 at 19:20 Comment(0)

If you have one char, then there is no order. But if you order for example NAMES of PEOPLE - different special chars in different languages are ordered differently depending on collation.

first a collation may be case sensitive - show all B before b - and second special chars have special rules depending on collation.

The documentation is pretty good on that.

Saipan answered 18/3, 2012 at 8:4 Comment(2)

How can sql sort english , arabic , hebrew together ? there is no common logic...? please explain – Introduction 18/3, 2012 at 9:56

You can, if you are ignorant. And use for example english fort order everywhere. Saldly this is not necessarily correct. sqlservercentral.com/blogs/rocks/2012/01/09/… has one example (spanish). ou think MS are idiots? msdn.microsoft.com/en-us/library/ms144250.aspx has all collations and a LOT of them are langauge specific. You think they make that without a reason? No, soemtimes the position of special chars in the sort order is dependant by language. – Saipan 18/3, 2012 at 11:49

I think the original poster is getting confused between CODE PAGES and COLLATIONS.

The "n" in nvarchar/nchar allows you to store text using the unicode number set which is large enough to incorporate all characters in all languages (in principle anyway) with a unique number. This itself isn't related to collations. nvarchar/nchar does not use CODE PAGES to encode/decode the meaning of each character code.

Collations define the sort order of characters and which character variants should be treated as identical. nvarchar/nchar DOES use COLLATIONS to define these distinctions.

Bunin answered 8/1, 2014 at 14:36 Comment(0)

Recommended topics

Hot tags