Does using ASCII/Latin Charset speed up the database?
Asked Answered
D

2

9

It would seem that using the ASCII charset for most fields and then specify UTF-8 only for the fields that need it would reduce the amount of I/O the database must perform by 100%.

Anyone know if this is true?

Update: The above was not really my question. I should have said: use Latin for the default character set and then only specify utf8mb4 only for the fields that need it. The thinking being that: using 1 byte vs 2 bytes should improve I/O by 100%. Sorry for the confusion.

Discombobulate answered 23/7, 2018 at 23:24 Comment(4)
FWIW: UTF-8 takes the same "space" as ASCII (characters 0-127). The UTF-8 encoding only takes more space for "extended" ASCII (127 > x >= 255, aka 'ANSI') or higher code-points, but "extended" ASCII doesn't support code-points > 255 and has a flexible mapping definition (which code page?), so it's not even Apples to Oranges.Warehouse
tldr; for English-alphabet "compatible" languages there is no difference in storage requirements between ASCII and UTF-8 encodings. For non-English languages, using ASCII is problematic for other reasons.Warehouse
English text is not even compatible with the "English Alphabet". Check any number of English-language websites—including this one.Dorthadorthea
my mistake, i thought that Ascii was 8 bit not 7 bit. So for 8 bit i should use "CHARACTER SET utf8 COLLATE utf8_bin"Discombobulate
I
8

@RickJames is right, you should not worry about saving space by choosing ASCII or utf8 over utf8mb4.

utf8 and utf8mb4 are variable-length character encodings. This table from wikipedia illustrates how characters automatically take 1, 2, 3, or 4 bytes each, depending on the value encoded. If the high bit of a byte is set, then the character uses an additional byte, up to 4 bytes.

enter image description here The wikipedia article explains it clearly:

The first 128 characters (US-ASCII) need one byte. The next 1,920 characters need two bytes to encode, which covers the remainder of almost all Latin-script alphabets, and also Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac, Thaana and N'Ko alphabets, as well as Combining Diacritical Marks. Three bytes are needed for characters in the rest of the Basic Multilingual Plane, which contains virtually all characters in common use including most Chinese, Japanese and Korean characters. Four bytes are needed for characters in the other planes of Unicode, which include less common CJK characters, various historic scripts, mathematical symbols, and emoji (pictographic symbols).

You don't have to do anything to choose single-byte versus multi-byte mode. This is just the way the encoding works. Each character automatically uses the number of bytes it needs, and no more.

So there is no advantage to using utf8 over utf8mb4, and no advantage of using ASCII over either, unless you need to restrict the characters allowed in a string.

For what it's worth, the character set MySQL calls "utf8" is an alias for utf8mb3, an implementation of just the first three bytes of the UTF8 encoding. The MySQL server team blog (https://mysqlserverteam.com/mysql-8-0-when-to-use-utf8mb3-over-utf8mb4/) says that utf8mb4 is faster, at least given performance improvements in MySQL 8.0, and utf8mb3 should be considered deprecated. MySQL 8.0.11 release notes say that utf8 will be redefined as an alias for utf8mb4 in some future version of MySQL.

Inness answered 24/7, 2018 at 16:20 Comment(1)
Great answer bill. ThanksDiscombobulate
N
13

Short Answer

Not worth worrying about.

Long Answer

Two issues:

Speed

Comparing two encodings with the corresponding _bin (ascii_bin or utf8_bin) COLLATION is as simple as comparing the bytes -- so no significant difference. Other collations can differ, with ascii being faster. But the difference is insignificant compared to the effort of fetching rows, etc.

Space

ascii is a subset of utf8. utf8 stores only 1 byte for each ASCII character, just as ascii does. So, no space difference. (Accented letters in Western Europe need either 1-byte latin1 or 2-byte utf8; hence incompatible and different in size). Space leads to caching, which leads to a slight difference in performance.

For English text, 0% savings. For European, latin1 would save only a few percent; For most the rest of the world, utf8 are the only viable solution. For Chinese and Emoji, utf8mb4 is a must.

Temp tables

In certain situations, the space consumed by a string expands to the potential max. country_code CHAR(2) CHARACTER SET ... will take 2 bytes for ascii; 6 bytes for utf8.

Bottom Line:

Use ascii for country codes, hex, postal codes, UUIDs, MD5s etc. If you are going international, and/or need Emoji, then make your "strings" utf8mb4. But do it because it is 'right', not because you will get magically marvelously much more speed; you won't. And do it whenever you create a table; it's the pits to change it later.

Nic answered 23/7, 2018 at 23:51 Comment(5)
Before my database was using utf8mb4 everwhere. So to goto utf-8 and utf8-bin and only use utf8mb4 on the fields that need it should be a 100% reduction in I/O?Discombobulate
@Discombobulate - CHARACTER SET utf8 does not reduce I/O any, relative to utf8mb4. Stay with utf8mb4, except for a few things that are clearly ascii. The "100%" is bogus.Nic
Rick, sorry for being so thick about this. The manual says that utf8mb4 is 4-byte UTF-8 Encoding. So if I write "aaaaaaaaaa" to a varchar(10) field will allocate 40 bytes on disk? It that same field was encoded with ASCII the same varchar(10) would allocated 10 bytes on disk? if I have a varchar(10) if I encode it with utf8mb4 it will take 20 bytes on disk?Discombobulate
@Discombobulate - 10 bytes, not 40. The English characters (eg a) are a subset of utf8 and utf8mb4. Each English character (in a VARCHAR or TEXT takes only 1 byte. That is, English takes the same space and is encoded the same in almost all character sets. Accented western European characters take 2 bytes. 10 Chinese characters take 3 or 4 bytes each.Nic
@Discombobulate - More details on character sizesNic
I
8

@RickJames is right, you should not worry about saving space by choosing ASCII or utf8 over utf8mb4.

utf8 and utf8mb4 are variable-length character encodings. This table from wikipedia illustrates how characters automatically take 1, 2, 3, or 4 bytes each, depending on the value encoded. If the high bit of a byte is set, then the character uses an additional byte, up to 4 bytes.

enter image description here The wikipedia article explains it clearly:

The first 128 characters (US-ASCII) need one byte. The next 1,920 characters need two bytes to encode, which covers the remainder of almost all Latin-script alphabets, and also Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac, Thaana and N'Ko alphabets, as well as Combining Diacritical Marks. Three bytes are needed for characters in the rest of the Basic Multilingual Plane, which contains virtually all characters in common use including most Chinese, Japanese and Korean characters. Four bytes are needed for characters in the other planes of Unicode, which include less common CJK characters, various historic scripts, mathematical symbols, and emoji (pictographic symbols).

You don't have to do anything to choose single-byte versus multi-byte mode. This is just the way the encoding works. Each character automatically uses the number of bytes it needs, and no more.

So there is no advantage to using utf8 over utf8mb4, and no advantage of using ASCII over either, unless you need to restrict the characters allowed in a string.

For what it's worth, the character set MySQL calls "utf8" is an alias for utf8mb3, an implementation of just the first three bytes of the UTF8 encoding. The MySQL server team blog (https://mysqlserverteam.com/mysql-8-0-when-to-use-utf8mb3-over-utf8mb4/) says that utf8mb4 is faster, at least given performance improvements in MySQL 8.0, and utf8mb3 should be considered deprecated. MySQL 8.0.11 release notes say that utf8 will be redefined as an alias for utf8mb4 in some future version of MySQL.

Inness answered 24/7, 2018 at 16:20 Comment(1)
Great answer bill. ThanksDiscombobulate

© 2022 - 2024 — McMap. All rights reserved.