SQL Server default character encoding
Asked Answered
W

6

77

By default - what is the character encoding set for a database in Microsoft SQL Server?

How can I see the current character encoding in SQL Server?

Witchcraft answered 3/3, 2011 at 14:40 Comment(7)
Do you mean collation setting?Unlookedfor
As I remember in MSSQL xml is stored in UTF-16, nchar's is stored in UCS-2Backhouse
I'm not sure whether collation is the correct term, I mean for instance if it's using "utf-8" or "iso-8859-1" etcWitchcraft
@david, a SQL Server collation is more than a character set. It involves sort ordering and case sensitivity. See msdn.microsoft.com/en-us/library/ms187582.aspxGrapple
@david, SQL Server does not implicity use utf-8. For 8-bit data, it uses a codepage that it determined by the collation in use. For 16-bit data, it uses UCS-2. Whether a particular column is 8-bit or 16-bit is determined by that column datatype, e.g., varchar or nvarchar.Grapple
Interesting that the second posted version of the same answer (with a misspelled function, even!) is the accepted answer.Tube
@Backhouse There is no encoding difference between UTF-16 and UCS-2. The only difference is if Surrogate Pairs of UCS-2 / BMP Code Points are recognized as Code Points themselves (i.e. Supplementary Characters).Oxtail
G
58

If you need to know the default collation for a newly created database use:

SELECT SERVERPROPERTY('Collation')

This is the server collation for the SQL Server instance that you are running.

Grapple answered 3/3, 2011 at 15:2 Comment(7)
the OP asked for character encoding, not collation.Unreality
@1010, collation determines encoding in SQL Server.Grapple
afaik sqlserver uses UC2 for the unicode datatypes, the collation value that you get with SERVERPROPERTY hints the codepage used in the non unicode datatypes, but you have to look it up. (for example Modern Spanish is Windows1252)Unreality
Does that mean that I cannot stick with the standard varchar even by setting the entire database to a Unicode encoding? I've only found non-Unicode looking collations. I'm thinking of how MySQL does it: You can specify the charset (meaning encoding) and the collation on the db, table, and column level and don't need such strange things like nvarchar and N'Text' with all its conversion issues.Watch
@LonelyPixel, the short answer is no. But the answer is a bit more involved. see msdn.microsoft.com/en-us/library/ms143726.aspx and #9757269Grapple
@Watch Correct, SQL Server does not allow for setting VARCHAR columns to a Unicode encoding like some other RDBMS's do. And, NVARCHAR (as well as XML) data only has a single encoding: UTF-16 Little Endian. The default handling, in terms of the built-in functions, of that UTF-16 data (i.e. when not using a collation ending in _SC) is to only correctly interpret the initial UCS-2 characters, which are a subset of the UTF-16 characters.Oxtail
@Watch (and others): starting in SQL Server 2019, UTF-8 is now an option for CHAR and VARCHAR columns, variables, and string literals. Please see my post for details on this new feature: "Native UTF-8 Support in SQL Server 2019: Savior or False Prophet?"Oxtail
O
63

Encodings

In most cases, SQL Server stores Unicode data (i.e. that which is found in the XML and N-prefixed types) in UCS-2 / UTF-16 (storage is the same, UTF-16 merely handles Supplementary Characters correctly). This is not configurable: there is no option to use either UTF-8 or UTF-32 (see UPDATE section at the bottom re: UTF-8 starting in SQL Server 2019). Whether or not the built-in functions can properly handle Supplementary Characters, and whether or not those are sorted and compared properly, depends on the Collation being used. The older Collations — names starting with SQL_ (e.g. SQL_Latin1_General_CP1_CI_AS) xor no version number in the name (e.g. Latin1_General_CI_AS) — equate all Supplementary Characters with each other (due to having no sort weight). Starting in SQL Server 2005 they introduced the 90 series Collations (those with _90_ in the name) that could at least do a binary comparison on Supplementary Characters so that you could differentiate between them, even if they didn't sort in the desired order. That also holds true for the 100 series Collations introduced in SQL Server 2008. SQL Server 2012 introduced Collations with names ending in _SC that not only sort Supplementary Characters properly, but also allow the built-in functions to interpret them as expected (i.e. treating the surrogate pair as a single entity). Starting in SQL Server 2017, all new Collations (the 140 series) implicitly support Supplementary Characters, hence there are no new Collations with names ending in _SC.

Starting in SQL Server 2019, UTF-8 became a supported encoding for CHAR and VARCHAR data (columns, variables, and literals), but not TEXT (see UPDATE section at the bottom re: UTF-8 starting in SQL Server 2019).

Non-Unicode data (i.e. that which is found in the CHAR, VARCHAR, and TEXT types — but don't use TEXT, use VARCHAR(MAX) instead) uses an 8-bit encoding (Extended ASCII, DBCS, or EBCDIC). The specific character set / encoding is based on the Code Page, which in turn is based on the Collation of a column, or the Collation of the current database for literals and variables, or the Collation of the Instance for variable / cursor names and GOTO labels, or what is specified in a COLLATE clause if one is being used.

To see how locales match up to collations, check out:

To see the Code Page associated with a particular Collation (this is the character set and only affects CHAR / VARCHAR / TEXT data), run the following:

SELECT COLLATIONPROPERTY( 'Latin1_General_100_CI_AS' , 'CodePage' ) AS [CodePage];

To see the LCID (i.e. locale) associated with a particular Collation (this affects the sorting & comparison rules), run the following:

SELECT COLLATIONPROPERTY( 'Latin1_General_100_CI_AS' , 'LCID' ) AS [LCID];

To view the list of available Collations, along with their associated LCIDs and Code Pages, run:

SELECT [name],
       COLLATIONPROPERTY( [name], 'LCID' ) AS [LCID],
       COLLATIONPROPERTY( [name], 'CodePage' ) AS [CodePage]
FROM sys.fn_helpcollations()
ORDER BY [name];

Defaults

Before looking at the Server and Database default Collations, one should understand the relative importance of those defaults.

The Server (Instance, really) default Collation is used as the default for newly created Databases (including the system Databases: master, model, msdb, and tempdb). But this does not mean that any Database (other than the 4 system DBs) is using that Collation. The Database default Collation can be changed at any time (though there are dependencies that might prevent a Database from having it's Collation changed). The Server default Collation, however, is not so easy to change. For details on changing all collations, please see: Changing the Collation of the Instance, the Databases, and All Columns in All User Databases: What Could Possibly Go Wrong?

The server/Instance Collation controls:

  • local variable names
  • CURSOR names
  • GOTO labels
  • Instance-level meta-data

The Database default Collation is used in three ways:

  • as the default for newly created string columns. But this does not mean that any string column is using that Collation. The Collation of a column can be changed at any time. Here knowing the Database default is important as an indication of what the string columns are most likely set to.
  • as the Collation for operations involving string literals, variables, and built-in functions that do not take string inputs but produces a string output (i.e. IF (@InputParam = 'something') ). Here knowing the Database default is definitely important as it governs how these operations will behave.
  • Database-level meta-data

The column Collation is either specified in the COLLATE clause at the time of the CREATE TABLE or an ALTER TABLE {table_name} ALTER COLUMN, or if not specified, taken from the Database default.

Since there are several layers here where a Collation can be specified (Database default / columns / literals & variables), the resulting Collation is determined by Collation Precedence.

All of that being said, the following query shows the default / current settings for the OS, SQL Server Instance, and specified Database:

SELECT os_language_version,
       ---
       SERVERPROPERTY('LCID') AS 'Instance-LCID',
       SERVERPROPERTY('Collation') AS 'Instance-Collation',
       SERVERPROPERTY('ComparisonStyle') AS 'Instance-ComparisonStyle',
       SERVERPROPERTY('SqlSortOrder') AS 'Instance-SqlSortOrder',
       SERVERPROPERTY('SqlSortOrderName') AS 'Instance-SqlSortOrderName',
       SERVERPROPERTY('SqlCharSet') AS 'Instance-SqlCharSet',
       SERVERPROPERTY('SqlCharSetName') AS 'Instance-SqlCharSetName',
       ---
       DATABASEPROPERTYEX(N'{database_name}', 'LCID') AS 'Database-LCID',
       DATABASEPROPERTYEX(N'{database_name}', 'Collation') AS 'Database-Collation',
  DATABASEPROPERTYEX(N'{database_name}', 'ComparisonStyle') AS 'Database-ComparisonStyle',
       DATABASEPROPERTYEX(N'{database_name}', 'SQLSortOrder') AS 'Database-SQLSortOrder'
FROM   sys.dm_os_windows_info;

Installation Default

Another interpretation of "default" could mean what default Collation is selected for the Instance-level collation when installing. That varies based on the OS language, but the (horrible, horrible) default for systems using "US English" is SQL_Latin1_General_CP1_CI_AS. In that case, the "default" encoding is Windows Code Page 1252 for VARCHAR data, and as always, UTF-16 for NVARCHAR data. You can find the list of OS language to default SQL Server collation here: Collation and Unicode support: Server-level collations. Keep in mind that these defaults can be overridden; this list is merely what the Instance will use if not overridden during install.


UPDATE 2018-10-02

SQL Server 2019 introduces native support for UTF-8 in VARCHAR / CHAR datatypes (not TEXT!). This is accomplished via a set of new collations, the names of which all end with _UTF8. This is an interesting capability that will definitely help some folks, but there are some "quirks" with it, especially when UTF-8 isn't being used for all columns and the Database's default Collation, so don't use it just because you have heard that UTF-8 is magically better. UTF-8 was designed solely for ASCII compatibility: to enable ASCII-only systems (i.e. UNIX back in the day) to support Unicode without changing any existing code or files. That it saves space for data using mostly (or only) US English characters (and some punctuation) is a side-effect. When not using mostly (or only) US English characters, data can be the same size as UTF-16, or even larger, depending on which characters are being used. And, in cases where space is being saved, performance might improve, but it might also get worse.

For a detailed analysis of this new feature, please see my post, "Native UTF-8 Support in SQL Server 2019: Savior or False Prophet?".

Oxtail answered 1/10, 2015 at 3:41 Comment(2)
And in addition to the defaults above, BULK INSERT, by default, interprets the incoming file as encoded in the system OEM. You need to specify the code page in your WITH parameters, as, e.g. if incoming text is ANSII: CODEPAGE = 'ACP'Hansel
This is not an answer....this is a BIBLE!! Thanks!Pen
G
58

If you need to know the default collation for a newly created database use:

SELECT SERVERPROPERTY('Collation')

This is the server collation for the SQL Server instance that you are running.

Grapple answered 3/3, 2011 at 15:2 Comment(7)
the OP asked for character encoding, not collation.Unreality
@1010, collation determines encoding in SQL Server.Grapple
afaik sqlserver uses UC2 for the unicode datatypes, the collation value that you get with SERVERPROPERTY hints the codepage used in the non unicode datatypes, but you have to look it up. (for example Modern Spanish is Windows1252)Unreality
Does that mean that I cannot stick with the standard varchar even by setting the entire database to a Unicode encoding? I've only found non-Unicode looking collations. I'm thinking of how MySQL does it: You can specify the charset (meaning encoding) and the collation on the db, table, and column level and don't need such strange things like nvarchar and N'Text' with all its conversion issues.Watch
@LonelyPixel, the short answer is no. But the answer is a bit more involved. see msdn.microsoft.com/en-us/library/ms143726.aspx and #9757269Grapple
@Watch Correct, SQL Server does not allow for setting VARCHAR columns to a Unicode encoding like some other RDBMS's do. And, NVARCHAR (as well as XML) data only has a single encoding: UTF-16 Little Endian. The default handling, in terms of the built-in functions, of that UTF-16 data (i.e. when not using a collation ending in _SC) is to only correctly interpret the initial UCS-2 characters, which are a subset of the UTF-16 characters.Oxtail
@Watch (and others): starting in SQL Server 2019, UTF-8 is now an option for CHAR and VARCHAR columns, variables, and string literals. Please see my post for details on this new feature: "Native UTF-8 Support in SQL Server 2019: Savior or False Prophet?"Oxtail
W
24

The default character encoding for a SQL Server database is iso_1, which is ISO 8859-1. Note that the character encoding depends on the data type of a column. You can get an idea of what character encodings are used for the columns in a database as well as the collations using this SQL:

select data_type, character_set_catalog, character_set_schema, character_set_name, collation_catalog, collation_schema, collation_name, count(*) count
from information_schema.columns
group by data_type, character_set_catalog, character_set_schema, character_set_name, collation_catalog, collation_schema, collation_name;

If it's using the default, the character_set_name should be iso_1 for the char and varchar data types. Since nchar and nvarchar store Unicode data in UCS-2 format, the character_set_name for those data types is UNICODE.

Wherewithal answered 9/5, 2012 at 23:28 Comment(2)
This is not strictly correct. the default character encoding depends on the OS language option at the time of SQL Server installation.Grapple
Rob: @Grapple is correct re: the default encoding (via the default collation) being set at install time. I have a link to the mapping between OS locale and default collation in my answer. A much larger issue with this answer, however, is that SQL Server does not, in any way, support the ISO-8859-1 character set. It uses the very similar Windows Code Page 1252, which it erroneously labels as "iso_1". (I'll be generous and guess that perhaps originally, when it was still Sybase, it did use ISO-8859-1 and switched to CP 1252 when it split)Oxtail
T
16

SELECT DATABASEPROPERTYEX('DBName', 'Collation') SQLCollation;

Where DBName is your database name.

Tube answered 3/3, 2011 at 14:54 Comment(1)
Encoding (as per OP's question) and collation are not synonymous.Nostalgia
E
1

I think this is worthy of a separate answer: although internally unicode data is stored as UTF-16 in Sql Server this is the Little Endian flavour, so if you're calling the database from an external system, you probably need to specify UTF-16LE.

Entourage answered 1/3, 2017 at 10:28 Comment(0)
O
0

You can see collation settings for each table like the following code:

SELECT t.name TableName, c.name ColumnName, collation_name  
FROM sys.columns c  
INNER JOIN sys.tables t on c.object_id = t.object_id where t.name = 'name of table';
Ozonosphere answered 20/4, 2020 at 6:38 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.