Differences between utf8 and latin1

Asked 25/4, 2010 at 16:38 Answered 25/4, 2010 at 16:54

170

what is the difference between utf8 and latin1?

Maser answered 25/4, 2010 at 16:38 Comment(3)

They are different encodings (with some characters mapped to common byte sequences, e.g. the ASCII characters and many accented letters). UTF-8 is one encoding of Unicode with all its codepoints; Latin1 encodes less than 256 characters. – Ictus 25/4, 2010 at 16:45

There is also latin9 which is available in Linux locales and could have been mentioned in the question: en.wikipedia.org/wiki/ISO/IEC_8859-15 – Callihan 6/4, 2020 at 17:19

Does this answer your question? What is the difference between UTF-8 and ISO-8859-1? – Whatever 5/8, 2022 at 2:57

187

UTF-8 is prepared for world domination, Latin1 isn't.

If you're trying to store non-Latin characters like Chinese, Japanese, Hebrew, Russian, etc using Latin1 encoding, then they will end up as mojibake. You may find the introductory text of this article useful (and even more if you know a bit Java).

Note that full 4-byte UTF-8 support was only introduced in MySQL 5.5. Before that version, it only goes up to 3 bytes per character, not 4 bytes per character. So, it supported only the BMP plane and not e.g. the Emoji plane. If you want full 4-byte UTF-8 support, upgrade MySQL to at least 5.5 or go for another RDBMS like PostgreSQL. In MySQL 5.5+ it's called utf8mb4.

Sikora answered 25/4, 2010 at 16:54 Comment(12)

Mysql 5.1 supports 3 byte UTF-8, however Mysql 5.5 does support 4 byte UTF-8 as utf8mb4. – Deathless 22/8, 2011 at 18:2

True that, but MySQL 5.5 wasn't GA at the moment this answer was posted. It was released December 2010. – Sikora 2/5, 2012 at 18:34

@Sikora Can you elaborate more on how UTF-8 isn't fully supported? Does it mean that Mysql 5.1 can't store all unicode characters? – Francis 12/6, 2012 at 5:54

@Pacerier: it only supports 3 bytes per character, thus only the BMP (the first 65535 characters) is supported, the remnant not. For all characters, see en.wikipedia.org/wiki/Plane_(Unicode) – Sikora 12/6, 2012 at 11:1

@Sikora So how do we store the unicode character LINEAR B SYLLABLE B008 A? : fileformat.info/info/unicode/char/10000/index.htm – Francis 12/6, 2012 at 18:29

@Pacerier: Upgrade to MySQL 5.5. – Sikora 12/6, 2012 at 18:45

@Sikora As for people using 5.1.63 and don't have the privilege to update the web server's mysql version, what may be the alternatives? – Francis 12/6, 2012 at 18:54

@Pacerier: You could save as VARBINARY instead of VARCHAR and decode/encode in the business tier yourself, but this is hacky. Consider asking a new question, maybe there are better ways. – Sikora 12/6, 2012 at 18:57

Good answer! Sorry to nitpick. Chinese, Japanese, Hebrew are languages and contain characters. But Cyrillic is a language system (and contains languages). – Sisyphus 23/8, 2018 at 20:41

@HoldOffHunger: Right, answer has been adjusted. – Sikora 24/8, 2018 at 15:42

@Ali "Before that version, it only goes up to 3 bytes, not 4 bytes per character." And there's nothing specifically to "Mysql 5.1". The change was in MySQL 5.5. – Sikora 1/2, 2019 at 11:53

You didn't answer the question, stackoverflow requires people to respond technically, not saying who is most used. Your answer is a typical offtopic. – Quadruplex 15/5, 2020 at 5:33

In latin1 each character is exactly one byte long. In utf8 a character can consist of more than one byte. Consequently utf8 has more characters than latin1 (and the characters they do have in common aren't necessarily represented by the same byte/bytesequence).

Slinkman answered 25/4, 2010 at 16:42 Comment(3)

What about ascii and bin? – Sisk 17/5, 2017 at 10:54

@YoushaAleayoub ASCII is a single-byte encoding which uses the characters 0 through 127, so it can encode half as many characters as latin1. It's a strict subset of both latin1 and utf8, meaning the bytes 0 through 127 in both latin1 and utf8 encode the same things as they do in ASCII. Bin isn't an encoding. It's usually an option that you can give when reading a file, telling the IO functions to not apply any encoding, but instead just read the file byte by byte. – Slinkman 17/5, 2017 at 11:38

thanks, I meant binary collate...? and which one is better for english/numeric fields: ascii_general_ci or ascii_bin? – Sisk 17/5, 2017 at 12:29

Recommended topics

Hot tags