Problem Summary:
While trying to convert a site with mysql database from latin1 to utf8, some special characters are not displaying correctly despite ensuring charsets are all utf8 system wide.
Problem Details:
This is a common problem. But I seem to have an added complexity.
Years ago, a oblivious developer (me), put a site together with MySQL. Some tables were setup with latin1_swedish_ci and utf8_general_ci. All input/display was done via pages with iso-8859-1 charset.
Now, I have the task of turning all this data into utf-8 and thus finally uniforming the encoding. However, i'm having issues with a number of special characters in both instances (ie: ü). The characters don't seem to display correctly on a UTF-8 page. They display as �.Instead When viewing the data in a utf8 table in mysql query browser, a correctly entered utf8'd 'u' displays as some special characters, while an incorrectly latin1 'u' displays as it should appear on page. But it doesn't.
I've tried a number of things:
- Percona script: https://github.com/rlowe/mysql_convert_charset
- converting col to binary and then to utf8
- converting utf8 tables to latin and then repeat above process
Nothing seems to cure the data.
Dumping the entire database and important isn't really a viable option as it's a huge database now and downtime is restricted.
UPDATE (22-Oct-2013)
I've taken @deceze suggestions and reviewed all my content encoding areas as per http://kunststube.net/frontback/. I did find a few places in which I was still passing/encoding data in latin1. So, i've now changed it all over to UTF-8. However, the data is still displaying incorrectly in a particular field. In a table which is in utf8 (no columns have implicit encoding), field1 is in latin1. I can confirm this by running the following which displays the text correctly:
select convert(cast(convert(field1 using latin1) as binary) using utf8) from my table WHERE id = 1
This will convert Hahnemühle to Hahnemühle.
In field2, it appears the data is in a different (unknown) encoding. The query above, when used on field2 converts Hahnem�hle to Hahnem�hle. I've gone through all the charsets on http://dev.mysql.com/doc/refman/5.5/en/charset-charsets.html replacing latin1 but none seem to spit out the data correctly.
latin1
column simply cannot store more than the 256 characters defined by Latin-1, other than that it doesn't matter in daily use. If you've inserted data using the correct connection encoding, there should be no problem. – Harmonieutf8
, you should get UTF-8 encoded data, however it is stored in your columns. Somewhere along the chain database → PHP → browser the data is converted to something that's not UTF-8. Can't tell you where with the information given. – Harmonieutf8
, it'll give you valid UTF-8. The characters may be garbage because you've mistreated them in the past, but it should be valid UTF-8 encoded garbage. – Harmoniebin2hex($string)
value of the string in question and the text that should show. – Harmonie