utf-8 vs latin1

D

4

23

What are the advantages/disadvantages between using utf8 as a charset against using latin1?

If UTF can support more chars and is used consistently wouldn't it always be the better choice? Is there any reason to choose latin1?

Dav answered 16/9, 2012 at 18:7 Comment(1)

Always use utf8mb4 and not the utf8 - it's a kind of MySQL bug. – Scarron 13/11, 2017 at 19:10

A

20

latin1 has the advantage that it is a single-byte encoding, therefore it can store more characters in the same amount of storage space because the length of string data types in MySql is dependent on the encoding. The manual states that

To calculate the number of bytes used to store a particular CHAR, VARCHAR, or TEXT column value, you must take into account the character set used for that column and whether the value contains multibyte characters. In particular, when using a utf8 Unicode character set, you must keep in mind that not all characters use the same number of bytes. utf8mb3 and utf8mb4 character sets can require up to three and four bytes per character, respectively. For a breakdown of the storage used for different categories of utf8mb3 or utf8mb4 characters, see Section 10.9, “Unicode Support”.

Furthermore lots of string operations (such as taking substrings and collation-dependent compares) are faster with single-byte encodings.

In any case, latin1 is not a serious contender if you care about internationalization at all. It can be an appropriate choice when you will be storing known safe values (such as percent-encoded URLs).

Arissa answered 16/9, 2012 at 18:14 Comment(6)

Does it also support other Unicode languages? Hebrew in particular? – Dav 16/9, 2012 at 18:17

It doesn't support Hebrew, @qwertymk. See en.wikipedia.org/wiki/ISO/IEC_8859-1 for a list of scripts, and indeed individual characters, it does support. – Sarracenia 16/9, 2012 at 18:18

@qwertymk: Obviously not, it's called the Western European charset. – Arissa 16/9, 2012 at 18:19

If you never use characters that require multiple bytes, then UTF-8 is as efficient as latin1. I know that sounds redundant, but it makes it clear that if you only plan to use English text data, you won't incur any storage penalty, but you have the option to store text from any language. – Mohamed 16/9, 2012 at 18:49

According to dev.mysql.com/doc/refman/5.0/en/charset-unicode-utf8.html, MySQL does not support 4-byte UTF-8 – Hartzel 15/1, 2015 at 17:32

@RossSmithII: It does from 5.5.3 onwards, with the utf8mb4 character set. I will agree it was not one of their best moments. – Arissa 16/1, 2015 at 9:46

H

23

UTF8 Advantages:

Supports most languages, including RTL languages such as Hebrew.
No translation needed when importing/exporting data to UTF8 aware components (JavaScript, Java, etc).

UTF8 Disadvantages:

Non-ASCII characters will take more time to encode and decode, due to their more complex encoding scheme.
Non-ASCII characters will take more space as they may be stored using more than 1 byte (characters not in the first 127 characters of the ASCII characters set). A CHAR(10) or VARCHAR(10) field may need up to 30 bytes to store some UTF8 characters.
Collations other than utf8_bin will be slower as the sort order will not directly map to the character encoding order), and will require translation in some stored procedures (as variables default to utf8_general_ci collation).
If you need to JOIN UTF8 and non-UTF8 fields, MySQL will impose a SEVERE performance hit. What would be sub-second queries could potentially take minutes if the fields joined are different character sets/collations.

Bottom line:

If you don't need to support non-Latin1 languages, want to achieve maximum performance, or already have tables using latin1, choose latin1.

Otherwise, choose UTF8.

Hartzel answered 16/9, 2012 at 18:24 Comment(3)

The statement "You may need to increase your CHAR field's lengths to allow for the additional space, as a VARCHAR(10) may only be able to store five, or less, characters of UTF8 data." (in Disadvantage 1) is incorrect. The column size reflects the maximum number of characters allowed, not the storage size (see dev.mysql.com/doc/refman/5.6/en/storage-requirements.html). – Teakwood 5/1, 2015 at 16:38

meden: You're absolutely right. I've updated my answer to reflect this fact. Sorry for the mistake. – Hartzel 15/1, 2015 at 17:33

what about ASCII? instead of Latin – Alleris 25/10, 2016 at 11:58

A

20

latin1 has the advantage that it is a single-byte encoding, therefore it can store more characters in the same amount of storage space because the length of string data types in MySql is dependent on the encoding. The manual states that

To calculate the number of bytes used to store a particular CHAR, VARCHAR, or TEXT column value, you must take into account the character set used for that column and whether the value contains multibyte characters. In particular, when using a utf8 Unicode character set, you must keep in mind that not all characters use the same number of bytes. utf8mb3 and utf8mb4 character sets can require up to three and four bytes per character, respectively. For a breakdown of the storage used for different categories of utf8mb3 or utf8mb4 characters, see Section 10.9, “Unicode Support”.

Furthermore lots of string operations (such as taking substrings and collation-dependent compares) are faster with single-byte encodings.

In any case, latin1 is not a serious contender if you care about internationalization at all. It can be an appropriate choice when you will be storing known safe values (such as percent-encoded URLs).

Arissa answered 16/9, 2012 at 18:14 Comment(6)

Does it also support other Unicode languages? Hebrew in particular? – Dav 16/9, 2012 at 18:17

It doesn't support Hebrew, @qwertymk. See en.wikipedia.org/wiki/ISO/IEC_8859-1 for a list of scripts, and indeed individual characters, it does support. – Sarracenia 16/9, 2012 at 18:18

@qwertymk: Obviously not, it's called the Western European charset. – Arissa 16/9, 2012 at 18:19

If you never use characters that require multiple bytes, then UTF-8 is as efficient as latin1. I know that sounds redundant, but it makes it clear that if you only plan to use English text data, you won't incur any storage penalty, but you have the option to store text from any language. – Mohamed 16/9, 2012 at 18:49

According to dev.mysql.com/doc/refman/5.0/en/charset-unicode-utf8.html, MySQL does not support 4-byte UTF-8 – Hartzel 15/1, 2015 at 17:32

@RossSmithII: It does from 5.5.3 onwards, with the utf8mb4 character set. I will agree it was not one of their best moments. – Arissa 16/1, 2015 at 9:46

D

7

@Ross Smith II, Point 4 is worth gold, meaning inconsistency between columns can be dangerous.

To add value to the already good answers, here is a small performance test about the difference between charsets:

A modern 2013 server, real use table with 20000 rows, no index on concerned column.

SELECT 4 FROM subscribers WHERE 1 ORDER BY time_utc_str; (4 is cache buster)

varchar(20) CHARACTER SET latin1 COLLATION latin1_bin: 15ms
varbinary(20): 17ms
utf8_bin: 20ms
utf8_general_ci: 23ms

For simple strings like numerical dates, my decision would be, when performance is concerned, using utf8_bin (CHARACTER SET utf8 COLLATE utf8_bin). This would prevent any adverse effects with other code that expects database charsets to be utf8 while still being sort of binary.

Disbelief answered 22/7, 2014 at 15:20 Comment(0)

O

1

Fixed-length encodings such as latin-1 are always more efficient in terms of CPU consumption.

If the set of tokens in some fixed-length character set is known to be sufficient for your purpose at hand, and your purpose involves heavy and intensive string processing, with lots of LENGTH() and SUBSTR() stuff, then that could be a good reason for not using encodings such as UTF-8.

Oh, and BTW. Do not confuse, as you seem to do, between a character set and an encoding thereof. A character set is some defined set of writeable glyphs. The same character set can have multiple distinct encodings. The various versions of the unicode standard each constitute a character set. Each of them can be subjected to either UTF-8, UTF-16 and "UTF-32" (not an official name, but it refers to the idea of using full four bytes for any character) encoding, and the latter two can each come in a HOB-first or HOB-last flavour.

Obstruction answered 16/9, 2012 at 18:13 Comment(0)

Recommended topics

Hot tags