Converting iso-8859-1 data to UTF-8 in UTF8 and Latin1 tables
Asked Answered
R

3

5

Problem Summary:

While trying to convert a site with mysql database from latin1 to utf8, some special characters are not displaying correctly despite ensuring charsets are all utf8 system wide.

Problem Details:

This is a common problem. But I seem to have an added complexity.

Years ago, a oblivious developer (me), put a site together with MySQL. Some tables were setup with latin1_swedish_ci and utf8_general_ci. All input/display was done via pages with iso-8859-1 charset.

Now, I have the task of turning all this data into utf-8 and thus finally uniforming the encoding. However, i'm having issues with a number of special characters in both instances (ie: ü). The characters don't seem to display correctly on a UTF-8 page. They display as �.Instead When viewing the data in a utf8 table in mysql query browser, a correctly entered utf8'd 'u' displays as some special characters, while an incorrectly latin1 'u' displays as it should appear on page. But it doesn't.

I've tried a number of things:

  1. Percona script: https://github.com/rlowe/mysql_convert_charset
  2. converting col to binary and then to utf8
  3. converting utf8 tables to latin and then repeat above process

Nothing seems to cure the data.

Dumping the entire database and important isn't really a viable option as it's a huge database now and downtime is restricted.

UPDATE (22-Oct-2013)

I've taken @deceze suggestions and reviewed all my content encoding areas as per http://kunststube.net/frontback/. I did find a few places in which I was still passing/encoding data in latin1. So, i've now changed it all over to UTF-8. However, the data is still displaying incorrectly in a particular field. In a table which is in utf8 (no columns have implicit encoding), field1 is in latin1. I can confirm this by running the following which displays the text correctly:

select convert(cast(convert(field1 using latin1) as binary) using utf8) from my table WHERE id = 1

This will convert Hahnemühle to Hahnemühle.

In field2, it appears the data is in a different (unknown) encoding. The query above, when used on field2 converts Hahnem�hle to Hahnem�hle. I've gone through all the charsets on http://dev.mysql.com/doc/refman/5.5/en/charset-charsets.html replacing latin1 but none seem to spit out the data correctly.

Rosewood answered 21/10, 2013 at 14:20 Comment(16)
What exactly are those "problems" you're having? Are those characters stored correctly in their respective columns? Do they look as they should when viewing them in a decent administrative interface, or are they garbled already? Unless you're trying to store non-Latin1 characters in a Latin1 column, it doesn't really matter what setting these columns were in so far. A latin1 column simply cannot store more than the 256 characters defined by Latin-1, other than that it doesn't matter in daily use. If you've inserted data using the correct connection encoding, there should be no problem.Harmonie
@Harmonie I've added the problem (sorry). The characters aren't being displayed correctly on a utf-8 page. Instead I get a diamond question mark in Google Chrome. When viewing the data in a utf8 table in mysql query browser, a correctly entered utf8'd 'u' displays as some special characters, while an incorrectly latin1 'u' displays as it should appear on page. But it doesn't.Rosewood
Have you updated the connection encoding so data coming from the database is actually UTF-8 encoded? Or are you still retrieving data in Latin-1?Harmonie
@Harmonie it was set as latin1_swedish_ci and i've changed it to utf8_general_ci as recommended in the Percona script.Rosewood
Nope, not the column collation, the connection encoding. Please read Handling Unicode Front To Back In A Web App for a rundown of the concept.Harmonie
@Harmonie character_set_connection = utf8 and collation_connection = utf8_general_ciRosewood
Well, if the browser is displaying a �, that means it's trying to interpret whatever you're sending it as UTF-8, but you're not actually sending it UTF-8. If you set the connection encoding to the database to utf8, you should get UTF-8 encoded data, however it is stored in your columns. Somewhere along the chain database → PHP → browser the data is converted to something that's not UTF-8. Can't tell you where with the information given.Harmonie
@Harmonie This is why I think it's something to do with the data.Rosewood
It shouldn't. If you tell MySQL you want utf8, it'll give you valid UTF-8. The characters may be garbage because you've mistreated them in the past, but it should be valid UTF-8 encoded garbage.Harmonie
@Harmonie If I spit out the problematic text on a php script and wrap it in htmlentities() it's all good. But use htmlsepcialchars() (or nothing) and it's no good.Rosewood
Please show the bin2hex($string) value of the string in question and the text that should show.Harmonie
@Harmonie Hahnemühle = 4861686e656dfc686c65Rosewood
Yup, that string is Latin-1 encoded!Harmonie
wtf! Everything is utf8 nowRosewood
Well, it's not. Somewhere your chain is broken. Make sure you understand everything in the aforelinked article and double check everything in your own chain.Harmonie
@Harmonie Thanks for the advice. I've gone through the link and there were some inconsistencies which have been corrected now. However, I'm still having issues. Please see the update posted above.Rosewood
H
5

Setting a column to latin1 and others to utf8 is perfectly fine in MySQL. There's no problem to be solved here as such. This charset parameter just influences how the data is stored internally. Which of course also means that you cannot store, for example, "漢字" in a latin1 column. But assuming you're just storing "Latin-1 characters" in there, that's fine.

MySQL has something commonly called the connection encoding. It tells MySQL what encoding text is in that you send to it from PHP (or elsewhere), and what encoding you'd like back when retrieving data from MySQL. The column charset, the "input connection encoding" and "output connection encoding" can all be different things, MySQL will convert encodings on the fly accordingly as needed.

So, assuming you've used the correct connection encodings so far and data is stored properly in your database and you've not tried to store non-Latin-1 characters in Latin-1 columns, all you need to do to update your column charsets to UTF-8 is:

ALTER TABLE table MODIFY column TEXT [...] CHARACTER SET utf8;
Harmonie answered 21/10, 2013 at 14:32 Comment(0)
P
5

You can try mysqldump to convert from ISO-8859-1 to utf-8:

mysqldump --user=username --password=password --default-character-set=latin1 --skip-set-charset dbname > dump.sql
chgrep latin1 utf8 dump.sql (or when you prefer  sed -i "" 's/latin1/utf8/g' dump.sql) 
mysql --user=username --password=password --execute="DROP DATABASE dbname; CREATE DATABASE dbname CHARACTER SET utf8 COLLATE utf8_general_ci;"
mysql --user=username --password=password --default-character-set=utf8 dbname < dump.sql
Pleuropneumonia answered 22/10, 2013 at 10:47 Comment(1)
+1'ed for the sed tidbit, and the alternate chgrep method as a bonus.Readable
A
3

You may get rid of the "glyph" characters (�) by applying utf8_encode to the string before displaying it in your page.

Aguie answered 16/12, 2013 at 22:17 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.