Differences between utf8 and latin1
Asked Answered
M

2

170

what is the difference between utf8 and latin1?

Maser answered 25/4, 2010 at 16:38 Comment(3)
They are different encodings (with some characters mapped to common byte sequences, e.g. the ASCII characters and many accented letters). UTF-8 is one encoding of Unicode with all its codepoints; Latin1 encodes less than 256 characters.Ictus
There is also latin9 which is available in Linux locales and could have been mentioned in the question: en.wikipedia.org/wiki/ISO/IEC_8859-15Callihan
Does this answer your question? What is the difference between UTF-8 and ISO-8859-1?Whatever
S
187

UTF-8 is prepared for world domination, Latin1 isn't.

If you're trying to store non-Latin characters like Chinese, Japanese, Hebrew, Russian, etc using Latin1 encoding, then they will end up as mojibake. You may find the introductory text of this article useful (and even more if you know a bit Java).

Note that full 4-byte UTF-8 support was only introduced in MySQL 5.5. Before that version, it only goes up to 3 bytes per character, not 4 bytes per character. So, it supported only the BMP plane and not e.g. the Emoji plane. If you want full 4-byte UTF-8 support, upgrade MySQL to at least 5.5 or go for another RDBMS like PostgreSQL. In MySQL 5.5+ it's called utf8mb4.

Sikora answered 25/4, 2010 at 16:54 Comment(12)
Mysql 5.1 supports 3 byte UTF-8, however Mysql 5.5 does support 4 byte UTF-8 as utf8mb4.Deathless
True that, but MySQL 5.5 wasn't GA at the moment this answer was posted. It was released December 2010.Sikora
@Sikora Can you elaborate more on how UTF-8 isn't fully supported? Does it mean that Mysql 5.1 can't store all unicode characters?Francis
@Pacerier: it only supports 3 bytes per character, thus only the BMP (the first 65535 characters) is supported, the remnant not. For all characters, see en.wikipedia.org/wiki/Plane_(Unicode)Sikora
@Sikora So how do we store the unicode character LINEAR B SYLLABLE B008 A? : fileformat.info/info/unicode/char/10000/index.htmFrancis
@Pacerier: Upgrade to MySQL 5.5.Sikora
@Sikora As for people using 5.1.63 and don't have the privilege to update the web server's mysql version, what may be the alternatives?Francis
@Pacerier: You could save as VARBINARY instead of VARCHAR and decode/encode in the business tier yourself, but this is hacky. Consider asking a new question, maybe there are better ways.Sikora
Good answer! Sorry to nitpick. Chinese, Japanese, Hebrew are languages and contain characters. But Cyrillic is a language system (and contains languages).Sisyphus
@HoldOffHunger: Right, answer has been adjusted.Sikora
@Ali "Before that version, it only goes up to 3 bytes, not 4 bytes per character." And there's nothing specifically to "Mysql 5.1". The change was in MySQL 5.5.Sikora
You didn't answer the question, stackoverflow requires people to respond technically, not saying who is most used. Your answer is a typical offtopic.Quadruplex
S
63

In latin1 each character is exactly one byte long. In utf8 a character can consist of more than one byte. Consequently utf8 has more characters than latin1 (and the characters they do have in common aren't necessarily represented by the same byte/bytesequence).

Slinkman answered 25/4, 2010 at 16:42 Comment(3)
What about ascii and bin?Sisk
@YoushaAleayoub ASCII is a single-byte encoding which uses the characters 0 through 127, so it can encode half as many characters as latin1. It's a strict subset of both latin1 and utf8, meaning the bytes 0 through 127 in both latin1 and utf8 encode the same things as they do in ASCII. Bin isn't an encoding. It's usually an option that you can give when reading a file, telling the IO functions to not apply any encoding, but instead just read the file byte by byte.Slinkman
thanks, I meant binary collate...? and which one is better for english/numeric fields: ascii_general_ci or ascii_bin?Sisk

© 2022 - 2024 — McMap. All rights reserved.