When to use utf-8 and when to use latin1 in MySQL?
Asked Answered
V

8

12

I know that MySQL has default of latin1 encoding and apparently it takes 1 byte to store a character in latin1 and 3 bytes to store a character in utf-8 - is that correct?

I am working on a site that I hope will be used globally. Do I absolutely need to have utf-8? Or will I be able to get away with using latin1?

Also, I tried to change some tables from latin1 to utf8 but I got this error: Speficief key was too long; max key length is 1000 bytes Does anyone know the solution to this? And should I really solve that or may latin1 be enough?

Thanks, Alex

Vishinsky answered 1/2, 2011 at 0:43 Comment(0)
D
13

it takes 1 byte to store a character in latin1 and 3 bytes to store a character in utf-8 - is that correct?

It takes 1 bytes to store a latin1 character and 1 to 3 bytes to store a UTF8 character.

If you only use basic latin characters and punctuation in your strings (0 to 128 in Unicode), both charsets will occupy the same length.

Also, I tried to change some tables from latin1 to utf8 but I got this error: "Speficief key was too long; max key length is 1000 bytes" Does anyone know the solution to this? And should I really solve that or may latin1 be enough?

If you have a column of VARCHAR(334) or longer, MyISAM wont't let you create an index on it since there is remote possibility of the column to occupy more that 1000 bytes.

Note that keys of such length are rarely useful. You can create a prefixed index which will be almost as selective for any real-world data.

Dourine answered 1/2, 2011 at 0:54 Comment(7)
Assuming now we need to index the whole column, What's the best workaround to index a column which exceed 1000 bytes?Massey
@Pacerier: use a prefixed index: CREATE INDEX ix_mytable_longcolumn ON mytable (longcolumn(200))Dourine
But that doesn't index the whole column. For example, if we want a unique column of more than 1k bytes, we may use a prefixed index on the first 200 bytes. However, this prefixed index will force the first 200 bytes to be unique. What I want is not the first 200 bytes to be unique, the first 200 bytes can be equal as long as the entire 1000 bytes is unique within the table. What's the best solution to have a unique column of more than 1k bytes?Massey
@Pacerier: you want index for searching or for uniqueness? If for the latter, just index the string's MD5 hash (or any other hash unique enough).Dourine
For uniqueness. Since the data is more than 1000 bytes (let's assume 30k bytes), there will be a hash collision as the output is only 64 bytes. In other words, I consider the hash solution sub-standard, since we are risking a bug where data is detected as unique even though it doesn't already exist in the table. Is there a better alternative solution?Massey
@Pacerier: with any decent hash function, hash collision probability does not depend on the data size, only on the hash size and the number of entries. For any real-world data you can theoretically keep in a single table, it's safe to rely on, say, SHA-2. There is no alternative solution, MyISAM cannot ensure uniqueness of more than 1000 bytes.Dourine
While it is very unlikely you'll actually have a collision in practice, if it is of major concern you might also create a composite "fingerprint" of multiple hashes. For example, take the MD5, SHA-1, and CRC32 and concatenate them.Mojave
H
11

At a bare minimum I would suggest using UTF-8. Your data will be compatible with every other database out there nowadays since 90%+ of them are UTF-8.

If you go with LATIN1/ISO-8859-1 you risk the data being not properly stored because it doesn't support international characters... so you might run into something like the left side of this image:

enter image description here

If you go with UTF-8, you don't need to deal with these headaches.

Regarding your error, it sounds like you need to optimize your database. Consider this: http://bugs.mysql.com/bug.php?id=4541#c284415

It would help if you gave specifics on your table schema and column for that issue.

Harcourt answered 1/2, 2011 at 0:48 Comment(0)
S
4

If you allow users to post in their own languages, and if you want users from all countries to participate, you have to switch at least the tables containing those posts to UTF-8 - Latin1 covers only ASCII and western European characters. The same is true if you intend to use multiple languages for your UI. See this post for how to handle migration.

Semblance answered 1/2, 2011 at 0:49 Comment(0)
C
2

In my experience, if you plan to support Arabic, Russian, Asian languages or others, the investment in UTF-8 support upfront will pay off down the line. However, depending on your circumstances you may be able to get away with English for a while.

As for the error, you probably have a key or index field with more than 333 characters, the maximum allowed in MySQL with UTF-8 encoding. See this bug report.

Cliff answered 1/2, 2011 at 0:50 Comment(5)
This 333 characters thing is confusing. :) Many fields can have more than 333 characters, right? Is it a number field that can not have more than 333 characters? What exactly is the problem usually? Thanks!Vishinsky
@Genadinik: why would you want to index the whole column? For any real-world string, first 20 characters or so are enough for the index still to be selective.Dourine
You likely currently have a index or key field that is defined as VARCHAR(1000) or similar. You basically shouldn't have a index or key on a field that large anyway, but when converting to UTF-8, the field is increasing from 1000 bytes to 3000 bytes. As stated by Quassnoi, MyISAM won't let you create an index on a column of more than 1000 bytes. You will need to look through your table definitions to find out which column it is.Cliff
Ok that raises maybe a silly question :) ...but some columns have to be over 1000 characters. Like maybe the user's bio or an event description. Or is this error only for an index that is varchar (1000) (which would be a typo somewhere most likely)?Vishinsky
To answer my own question - yes I made the mistake of having a key be varchar(1000) - changing that solved that particular error :) thanks everyone :)Vishinsky
S
2

Current best practice is to never use MySQL's utf8 character set. Use utf8mb4 instead, which is a proper implementation of the standard.

See Adam Hooper's Explanation for more detail.

Note that in utf8mb4, characters have a variable number of bytes. As the name implies, characters are up to four bytes. For characters in the the latin character set, encoded as utf8mb4, they still occupy only one byte. Other characters, including those with accents, Kanji, and emoji's require two, three, or four bytes to store.

The Specified key was too long; max key length is 1000 bytes error occurs when an index contains columns in utf8mb4 because the index may be over this limit. You'll need to shorten the column length of some character columns or shorten the length of the index on the columns using this syntax to ensure that it is shorter than the limit.

ALTER TABLE.. ADD INDEX `myIndex` ( column1(15), column2(200) );

Sangsanger answered 4/2, 2021 at 16:15 Comment(0)
W
1

Since the max length of a key is 1000 BYTES, if you use utf8, then this will limmit you to 333 characters.

However MySQL is different form Oracle for charset. In Oracle you can't have a different character set per column, wheras in MySQL you can, so may be you can set the key to latin1 and other columns to utf8.

Finally I believe only defunct version 6.0alpha (ditched when Sun bought MySQL) could accomodate unicode characters beyound the BMP (Basic Multilingual Plan). So basically, even with UTF-8, you won't have all the whole unicode character set. In practice this is only a problem for rare Chinese characters, if that really matters to you.

Westing answered 1/2, 2011 at 0:52 Comment(0)
S
1

We did an application using Latin because it was the default. But later on we had to change everything to UTF because of spanish characters, not incredible difficult but no point having to change things unnecessarily.

So short answer is just go with UTF-8 from the beginning, it will save you trouble later on.

Sandbag answered 1/2, 2011 at 0:53 Comment(0)
E
1

I am not an expert, but I always understood that UTF-8 is actually a 4-byte wide encoding set, not 3. And as I understand it, the MySQL implementation of utf8_unicode_ci only handles a 3-byte wide encoding set...

If you want the full UTF-8 4-byte character encoding, you need to use utf8mb4_unicode_ci encoding for your MySQL database/tables.

Epithelium answered 30/5, 2017 at 19:44 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.