Detecting utf8 broken characters in MySQL
Asked Answered
C

17

51

I've got a database with a bunch of broken utf8 characters scattered across several tables. The list of characters isn't very extensive AFAIK (áéíúóÁÉÍÓÚÑñ)

Fixing a given table is very straightforward

update orderItem set itemName=replace(itemName,'á','á');

But I can't get a way of detecting the broken characters. If I do something like

SELECT * FROM TABLE WHERE field LIKE "%Ã%";

I get nearly all the fields because of the collation (Ã=a). All broken characters so far start with an "Ã". The database is in spanish so this particular character isn't used

The list of broken chars I've got so far is

á = á
é = é
í- = í
ó = ó
ñ = ñ
á = Á

Any idea of how to make this SELECT to work as intended? (a binary search or something like that)

Cockroach answered 25/9, 2009 at 9:32 Comment(3)
Are you sure you’re using UTF-8? Those “broken characters” seem as you’re interpreting UTF-8 encoded data with ISO 8859-1. Because the character á is encoded with UTF-8 as 0xC3A1 and that represents the characters à and ¡ in ISO 8859-1.Sacristan
Don't look at me. My work is to fix this db. I didn't broke it.Cockroach
This is called "Mojibake" and is discussed hereIlan
I
10

How about a different approach, namely converting the column back and forth to get the correct character set? You can convert it to binary, then to utf-8 and then to iso-8859-1 or whatever else you're using. See the manual for the details.

Intellectualize answered 25/9, 2009 at 12:36 Comment(4)
the idea is to end up whit a utf-8 encoded db. right now the encoding a collation is utf-8 general. but apparently the application that use the db was interpreting as ISO8859. If I convert it back and forth, I will end with the same data...Cockroach
well, only converting back and forth doesn't do magic, the characters stills broken. BUT in binary I can make a select looking for the à character. So now I have a mechanism to detect the broken chars. Thanks.Cockroach
Okay, I remain convinced that there must be a way to use the conversion mechanism in a more general way but it might be more complex than first stated. Happy you found a solution that worked.Intellectualize
If you mess up when altering column type, you may lose data. Instead with UPDATE queries, you can run UPDATE queries inside a transaction, check the result, and commit only if this is ok or rollback if not.Afterworld
N
89

I fixed with

UPDATE wp_zcs9ck_posts_copy SET post_title = 
    CONVERT(BINARY CONVERT(post_title USING latin1) USING utf8);

Complete solution: http://jonisalonen.com/2012/fixing-doubly-utf-8-encoded-text-in-mysql/

Neurovascular answered 9/8, 2013 at 1:36 Comment(5)
Backup your database first... this deleted everything but the first half sentence in my wp_posts -> post_content field!Debroahdebs
Worked for me as well. Consider using SELECT name, CONVERT(BINARY CONVERT(name USING latin1) USING utf8) AS conv FROM table WHERE id IN (SELECT id FROM table WHERE LENGTH(name) != CHAR_LENGTH(name)); to find broken records and see the result before using UPDATE.Sarasvati
I made unnecessary subquery in the comment above. Use this SELECT name, CONVERT(BINARY CONVERT(name USING latin1) USING utf8) AS conv FROM table WHERE LENGTH(name) != CHAR_LENGTH(name) insteadSarasvati
This answer applies to double-encoding, not simple Mojibake.Ilan
I have found that checking if CONVERT(BINARY CONVERT(bundle_display_name USING latin1) USING utf8) is not NULL before doing the update helps preventing already correctly encoded values getting broken (the test with length/char_length doesn't work well as "é" will be 3/1 but it's correct value "é" is still 2/1)Herat
T
41
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'á','á');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'ä','ä');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'é','é');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'í©','é');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'ó','ó');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'íº','ú');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'ú','ú');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'ñ','ñ');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'í‘','Ñ');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'Ã','í');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'–','–');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'’','\'');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'…','...');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'–','-');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'“','"');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'â€','"');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'‘','\'');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'•','-');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'‡','c');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'Â','');
Tb answered 25/8, 2010 at 16:29 Comment(3)
What about processing the .sql file with a scripting language like Perl and replace all the "wrong" characters? In Perl it can be easily done with something like this: $content =~ s/á/á/g; Repeat the same sentence for the rest of the characters. $content is a string storing the whole file.Courbet
This worked for me, though I had a totally different character set, so had to do every character manually using Adam Lynch's trick(https://mcmap.net/q/344309/-detecting-utf8-broken-characters-in-mysql), Only had to run queries 409 times, 'So I got that goin for me which is nice'.Brena
I've been trying to update a number of tables from other datasources for weeks using more comprehensive methods but this is by far the easiest. thanks! Was assembling this info myself today.Rhinelandpalatinate
V
24

Thanks for your answers!!

I fixed my tables with this, and wanted to share the full list of changes. Note that it also includes fixing html decoded characters, besides latin ones, it was really a mess:

(If you need more conversions, look them up at https://www.utf8-chartable.de/unicode-utf8-table.pl)

update `table` set `field` = replace(`field`, 'â€', '"');
update `table` set `field` = replace(`field`, '–', '–');
update `table` set `field` = replace(`field`, '•', '-');
update `table` set `field` = replace(`field`, '“', '"');

update `table` set `field` = replace(`field`, '¡', '¡');
update `table` set `field` = replace(`field`, '¢', '¢');
update `table` set `field` = replace(`field`, '£', '£');
update `table` set `field` = replace(`field`, '¤', '¤');
update `table` set `field` = replace(`field`, 'Â¥', '¥');
update `table` set `field` = replace(`field`, '¦', '¦');
update `table` set `field` = replace(`field`, '§', '§');
update `table` set `field` = replace(`field`, '¨', '¨');
update `table` set `field` = replace(`field`, '©', '©');
update `table` set `field` = replace(`field`, 'ª', 'ª');
update `table` set `field` = replace(`field`, '«', '«');
update `table` set `field` = replace(`field`, '¬', '¬');
# This one looks like it's missing a character, but it's there. 0xad
update `table` set `field` = replace(`field`, '­', '­');
update `table` set `field` = replace(`field`, '®', '®');
update `table` set `field` = replace(`field`, '¯', '¯');
update `table` set `field` = replace(`field`, '°', '°');
update `table` set `field` = replace(`field`, '±', '±');
update `table` set `field` = replace(`field`, '²', '²');
update `table` set `field` = replace(`field`, '³', '³');
update `table` set `field` = replace(`field`, '´', '´');
update `table` set `field` = replace(`field`, 'µ', 'µ');
update `table` set `field` = replace(`field`, '¶', '¶');
update `table` set `field` = replace(`field`, '·', '·');
update `table` set `field` = replace(`field`, '¸', '¸');
update `table` set `field` = replace(`field`, '¹', '¹');
update `table` set `field` = replace(`field`, 'º', 'º');
update `table` set `field` = replace(`field`, '»', '»');
update `table` set `field` = replace(`field`, '¼', '¼');
update `table` set `field` = replace(`field`, '½', '½');
update `table` set `field` = replace(`field`, '¾', '¾');
update `table` set `field` = replace(`field`, '¿', '¿');

update `table` set `field` = replace(`field`, 'À', 'À');
# This one looks like it's missing a character, but it's there. 0x81
update `table` set `field` = replace(`field`, 'Ã', 'Á');
update `table` set `field` = replace(`field`, 'Â', 'Â');
update `table` set `field` = replace(`field`, 'Ã', 'Ã');
update `table` set `field` = replace(`field`, 'Ä', 'Ä');
update `table` set `field` = replace(`field`, 'Ã…', 'Å');
update `table` set `field` = replace(`field`, 'Æ', 'Æ');
update `table` set `field` = replace(`field`, 'Ç', 'Ç');
update `table` set `field` = replace(`field`, 'È', 'È');
update `table` set `field` = replace(`field`, 'É', 'É');
update `table` set `field` = replace(`field`, 'Ê', 'Ê');
update `table` set `field` = replace(`field`, 'Ë', 'Ë');
update `table` set `field` = replace(`field`, 'ÃŒ', 'Ì');
# This one looks like it's missing a character, but it's there. 0x8d
update `table` set `field` = replace(`field`, 'Ã', 'Í');
update `table` set `field` = replace(`field`, 'ÃŽ', 'Î');
# This one looks like it's missing a character, but it's there. 0x8f
update `table` set `field` = replace(`field`, 'Ã', 'Ï');
# This one looks like it's missing a character, but it's there. 0x90
update `table` set `field` = replace(`field`, 'Ã', 'Ð');
update `table` set `field` = replace(`field`, 'Ñ', 'Ñ');
update `table` set `field` = replace(`field`, 'Ã’', 'Ò');
update `table` set `field` = replace(`field`, 'Ó', 'Ó');
update `table` set `field` = replace(`field`, 'Ô', 'Ô');
update `table` set `field` = replace(`field`, 'Õ', 'Õ');
update `table` set `field` = replace(`field`, 'Ö', 'Ö');
update `table` set `field` = replace(`field`, '×', '×');
update `table` set `field` = replace(`field`, 'Ø', 'Ø');
update `table` set `field` = replace(`field`, 'Ù', 'Ù');
update `table` set `field` = replace(`field`, 'Ú', 'Ú');
update `table` set `field` = replace(`field`, 'Û', 'Û');
update `table` set `field` = replace(`field`, 'Ãœ', 'Ü');
# This one looks like it's missing a character, but it's there. 0x9d
update `table` set `field` = replace(`field`, 'Ã', 'Ý');
update `table` set `field` = replace(`field`, 'Þ', 'Þ');
update `table` set `field` = replace(`field`, 'ß', 'ß');
update `table` set `field` = replace(`field`, 'à', 'à');
update `table` set `field` = replace(`field`, 'á', 'á');
update `table` set `field` = replace(`field`, 'â', 'â');
update `table` set `field` = replace(`field`, 'ã', 'ã');
update `table` set `field` = replace(`field`, 'ä', 'ä');
update `table` set `field` = replace(`field`, 'Ã¥', 'å');
update `table` set `field` = replace(`field`, 'æ', 'æ');
update `table` set `field` = replace(`field`, 'ç', 'ç');
update `table` set `field` = replace(`field`, 'è', 'è');
update `table` set `field` = replace(`field`, 'é', 'é');
update `table` set `field` = replace(`field`, 'ê', 'ê');
update `table` set `field` = replace(`field`, 'ë', 'ë');
update `table` set `field` = replace(`field`, 'ì', 'ì');
# This one looks like it's missing a character, but it's there. 0xad
update `table` set `field` = replace(`field`, '­­Ã', 'í');
update `table` set `field` = replace(`field`, 'î', 'î');
update `table` set `field` = replace(`field`, 'ï', 'ï');
update `table` set `field` = replace(`field`, 'ð', 'ð');
update `table` set `field` = replace(`field`, 'ñ', 'ñ');
update `table` set `field` = replace(`field`, 'ò', 'ò');
update `table` set `field` = replace(`field`, 'ó', 'ó');
update `table` set `field` = replace(`field`, 'ô', 'ô');
update `table` set `field` = replace(`field`, 'õ', 'õ');
update `table` set `field` = replace(`field`, 'ö', 'ö');
update `table` set `field` = replace(`field`, '÷', '÷');
update `table` set `field` = replace(`field`, 'ø', 'ø');
update `table` set `field` = replace(`field`, 'ù', 'ù');
update `table` set `field` = replace(`field`, 'ú', 'ú');
update `table` set `field` = replace(`field`, 'û', 'û');
update `table` set `field` = replace(`field`, 'ü', 'ü');
update `table` set `field` = replace(`field`, 'ý', 'ý');
update `table` set `field` = replace(`field`, 'þ', 'þ');
update `table` set `field` = replace(`field`, 'ÿ', 'ÿ');

update `table` set `field` = replace(`field` ,'ç','ç');
update `table` set `field` = replace(`field` ,'ã','ã');
update `table` set `field` = replace(`field` ,'á','á');
update `table` set `field` = replace(`field` ,'â','â');
update `table` set `field` = replace(`field` ,'é','é');
update `table` set `field` = replace(`field` ,'í','í');
update `table` set `field` = replace(`field` ,'õ','õ');
update `table` set `field` = replace(`field` ,'ú','ú');
update `table` set `field` = replace(`field` ,'ç','ç');
update `table` set `field` = replace(`field` ,'Á','Á');
update `table` set `field` = replace(`field` ,'Â','Â');
update `table` set `field` = replace(`field` ,'É','É');
update `table` set `field` = replace(`field` ,'Í','Í');
update `table` set `field` = replace(`field` ,'Õ','Õ');
update `table` set `field` = replace(`field` ,'Ú','Ú');
update `table` set `field` = replace(`field` ,'Ç','Ç');
update `table` set `field` = replace(`field` ,'Ã','Ã');
update `table` set `field` = replace(`field` ,'À','À');
update `table` set `field` = replace(`field` ,'Ê','Ê');
update `table` set `field` = replace(`field` ,'Ó','Ó');
update `table` set `field` = replace(`field` ,'Ô','Ô');
update `table` set `field` = replace(`field` ,'Ü','Ü');
update `table` set `field` = replace(`field` ,'ã','ã');
update `table` set `field` = replace(`field` ,'à','à');
update `table` set `field` = replace(`field` ,'ê','ê');
update `table` set `field` = replace(`field` ,'ó','ó');
update `table` set `field` = replace(`field` ,'ô','ô');
update `table` set `field` = replace(`field` ,'ü','ü');
update `table` set `field` = replace(`field` ,'&','&');
update `table` set `field` = replace(`field` ,'>','>');
update `table` set `field` = replace(`field` ,'&lt;','<');
update `table` set `field` = replace(`field` ,'&circ;','ˆ');
update `table` set `field` = replace(`field` ,'&tilde;','˜');
update `table` set `field` = replace(`field` ,'&uml;','¨');
update `table` set `field` = replace(`field` ,'&cute;','´');
update `table` set `field` = replace(`field` ,'&cedil;','¸');
update `table` set `field` = replace(`field` ,'&quot;','"');
update `table` set `field` = replace(`field` ,'&ldquo;','“');
update `table` set `field` = replace(`field` ,'&rdquo;','”');
update `table` set `field` = replace(`field` ,'&lsquo;','‘');
update `table` set `field` = replace(`field` ,'&rsquo;','’');
update `table` set `field` = replace(`field` ,'&lsaquo;','‹');
update `table` set `field` = replace(`field` ,'&rsaquo;','›');
update `table` set `field` = replace(`field` ,'&laquo;','«');
update `table` set `field` = replace(`field` ,'&raquo;','»');
update `table` set `field` = replace(`field` ,'&ordm;','º');
update `table` set `field` = replace(`field` ,'&ordf;','ª');
update `table` set `field` = replace(`field` ,'&ndash;','–');
update `table` set `field` = replace(`field` ,'&mdash;','—');
update `table` set `field` = replace(`field` ,'&macr;','¯');
update `table` set `field` = replace(`field` ,'&hellip;','…');
update `table` set `field` = replace(`field` ,'&brvbar;','¦');
update `table` set `field` = replace(`field` ,'&bull;','•');
update `table` set `field` = replace(`field` ,'&para;','¶');
update `table` set `field` = replace(`field` ,'&sect;','§');
update `table` set `field` = replace(`field` ,'&sup1;','¹');
update `table` set `field` = replace(`field` ,'&sup2;','²');
update `table` set `field` = replace(`field` ,'&sup3;','³');
update `table` set `field` = replace(`field` ,'&frac12;','½');
update `table` set `field` = replace(`field` ,'&frac14;','¼');
update `table` set `field` = replace(`field` ,'&frac34;','¾');
update `table` set `field` = replace(`field` ,'&#8539;','⅛');
update `table` set `field` = replace(`field` ,'&#8540;','⅜');
update `table` set `field` = replace(`field` ,'&#8541;','⅝');
update `table` set `field` = replace(`field` ,'&#8542;','⅞');
update `table` set `field` = replace(`field` ,'&gt;','>');
update `table` set `field` = replace(`field` ,'&lt;','<');
update `table` set `field` = replace(`field` ,'&plusmn;','±');
update `table` set `field` = replace(`field` ,'&minus;','−');
update `table` set `field` = replace(`field` ,'&times;','×');
update `table` set `field` = replace(`field` ,'&divide;','÷');
update `table` set `field` = replace(`field` ,'&lowast;','∗');
update `table` set `field` = replace(`field` ,'&frasl;','⁄');
update `table` set `field` = replace(`field` ,'&permil;','‰');
update `table` set `field` = replace(`field` ,'&int;','∫');
update `table` set `field` = replace(`field` ,'&sum;','∑');
update `table` set `field` = replace(`field` ,'&prod;','∏');
update `table` set `field` = replace(`field` ,'&radic;','√');
update `table` set `field` = replace(`field` ,'&infin;','∞');
update `table` set `field` = replace(`field` ,'&asymp;','≈');
update `table` set `field` = replace(`field` ,'&cong;','≅');
update `table` set `field` = replace(`field` ,'&prop;','∝');
update `table` set `field` = replace(`field` ,'&equiv;','≡');
update `table` set `field` = replace(`field` ,'&ne;','≠');
update `table` set `field` = replace(`field` ,'&le;','≤');
update `table` set `field` = replace(`field` ,'&ge;','≥');
update `table` set `field` = replace(`field` ,'&there4;','∴');
update `table` set `field` = replace(`field` ,'&sdot;','⋅');
update `table` set `field` = replace(`field` ,'&middot;','·');
update `table` set `field` = replace(`field` ,'&part;','∂');
update `table` set `field` = replace(`field` ,'&image;','ℑ');
update `table` set `field` = replace(`field` ,'&real;','ℜ');
update `table` set `field` = replace(`field` ,'&prime;','′');
update `table` set `field` = replace(`field` ,'&Prime;','″');
update `table` set `field` = replace(`field` ,'&deg;','°');
update `table` set `field` = replace(`field` ,'&ang;','∠');
update `table` set `field` = replace(`field` ,'&perp;','⊥');
update `table` set `field` = replace(`field` ,'&nabla;','∇');
update `table` set `field` = replace(`field` ,'&oplus;','⊕');
update `table` set `field` = replace(`field` ,'&otimes;','⊗');
update `table` set `field` = replace(`field` ,'&alefsym;','ℵ');
update `table` set `field` = replace(`field` ,'&oslash;','ø');
update `table` set `field` = replace(`field` ,'&Oslash;','Ø');
update `table` set `field` = replace(`field` ,'&isin;','∈');
update `table` set `field` = replace(`field` ,'&notin;','∉');
update `table` set `field` = replace(`field` ,'&cap;','∩');
update `table` set `field` = replace(`field` ,'&cup;','∪');
update `table` set `field` = replace(`field` ,'&sub;','⊂');
update `table` set `field` = replace(`field` ,'&sup;','⊃');
update `table` set `field` = replace(`field` ,'&sube;','⊆');
update `table` set `field` = replace(`field` ,'&supe;','⊇');
update `table` set `field` = replace(`field` ,'&exist;','∃');
update `table` set `field` = replace(`field` ,'&forall;','∀');
update `table` set `field` = replace(`field` ,'&empty;','∅');
update `table` set `field` = replace(`field` ,'&not;','¬');
update `table` set `field` = replace(`field` ,'&and;','∧');
update `table` set `field` = replace(`field` ,'&or;','∨');
update `table` set `field` = replace(`field` ,'&crarr;','↵');
Virtues answered 4/9, 2011 at 22:16 Comment(2)
That was the only solution that worked to me! The binary thing didnt work, even in a SELECT. Thanx!Fullerton
This is a great list but the ordering is problematic because by the time the search for "•" occurs for example, the †part at the beginning will have been replaced.Nescience
T
17

No text replacement is a universal solutions because you can forget some character. A more suitable fix for double converted characters is:

  1. convert back to latin1
  2. convert to binary
  3. convert to utf8

Like this:

alter table descriptions modify name VARCHAR(2000) character set latin1;
alter table descriptions modify name blob;
alter table descriptions modify name VARCHAR(2000) character set utf8;
Tamatave answered 6/2, 2012 at 12:8 Comment(5)
Though now I detected that it didn't work completely. Text that had a ' character was cut off at that character after the 3 alter table commands. :-(Maggoty
This answer worked well for me. I added a simple script to my class collection, which will do this for all tables in a database. Instruction can be found here: os-cms.net/blog/view/34/Fix-utf8-broken-characters-in-MySQL Just if someone would find it useful.Knurly
This truncated columns data where non UTF 8 characters present and chinese characters replaced with question marks - Not a good solution for ones don't want to lose data.Tove
Shorter to do it in 2 steps: ALTER TABLE tbl MODIFY COLUMN col VARBINARY(111) NOT NULL; ALTER TABLE tbl MODIFY COLUMN col VARCHAR(111) CHARACTER SET utf8mb4 NOT NULL;Ilan
Wow. This is awesome. Save me a lot of headache. Would be nice to have a script that does this to all varchar columns in all tables inside a db.Daffodil
D
16

The SELECT statement you need is the following:

SELECT * FROM TABLE WHERE LENGTH(name) != CHAR_LENGTH(name);

This returns all rows which contain multi-byte characters.

name is assumed to be a field / the field where weird characters would be found. *

Debidebilitate answered 24/5, 2011 at 14:41 Comment(0)
E
16

This saved my life

UPDATE ohp_posts SET post_content = CONVERT(CAST(CONVERT(post_content USING latin1) AS BINARY) USING utf8)

I've found it here http://stanis.net/2014/04/replacing-latin-1-with-utf-8-characters-in-mysql/

Elative answered 12/11, 2014 at 17:38 Comment(0)
I
10

How about a different approach, namely converting the column back and forth to get the correct character set? You can convert it to binary, then to utf-8 and then to iso-8859-1 or whatever else you're using. See the manual for the details.

Intellectualize answered 25/9, 2009 at 12:36 Comment(4)
the idea is to end up whit a utf-8 encoded db. right now the encoding a collation is utf-8 general. but apparently the application that use the db was interpreting as ISO8859. If I convert it back and forth, I will end with the same data...Cockroach
well, only converting back and forth doesn't do magic, the characters stills broken. BUT in binary I can make a select looking for the à character. So now I have a mechanism to detect the broken chars. Thanks.Cockroach
Okay, I remain convinced that there must be a way to use the conversion mechanism in a more general way but it might be more complex than first stated. Happy you found a solution that worked.Intellectualize
If you mess up when altering column type, you may lose data. Instead with UPDATE queries, you can run UPDATE queries inside a transaction, check the result, and commit only if this is ok or rollback if not.Afterworld
B
7

To convert all latin characters to correct accents try this on MySQL :

UPDATE your_table SET your_column = CONVERT(CAST(CONVERT(your_column USING latin1) AS BINARY) USING utf8)
Beardless answered 15/10, 2020 at 9:9 Comment(0)
W
3

I had this same problem but didn't like the replace() solution because there's always the possibility of missing some characters. I was working against a column with mixed data (some had been utf8_encode()d and some not) with 4 million or so rows, about 250k records with mis-encoded data (with É/etc characters), covering about 15 international languages, including mainly European languages but also Russian, Japanese and Chinese.

I started by duplicating the column, since I didn't want to lose any data:

ALTER TABLE images ADD COLUMN reptitle TEXT;

Copied all the data with multibyte characters (thanks Adam for the tip)

UPDATE images SET reptitle = title WHERE LENGTH(title) != CHAR_LENGTH(title)

Since reptitle was created with the table's default character set it was already utf8, but contained the corrupted data since images table used to be a latin source. Column reptitle now contains some data which is correctly encoded, and some corrupted (all values with multibyte characters, some had been correctly utf8_encode()d. So then with David's tip...

ALTER TABLE images MODIFY reptitle TEXT character set latin1;
ALTER TABLE images MODIFY reptitle BLOB;
ALTER TABLE images MODIFY reptitle TEXT character set utf8;

The middle step may not have been necessary since TEXT and BLOB (I think) are the same. This had the effect of correcting all incorrectly encoded data ('étudiantes' became 'étudiantes', etc) but data which was previously correct was truncated at the first multibyte character ('Lapin de Pâques' became 'Lapin de P'). I don't know why the truncation, but it's in a disposable column so I didn't care. The truncated data gives CHAR_LENGTH and LENGTH of the same values because there are no multi-byte characters remaining so easy query...

UPDATE images SET title = reptitle WHERE LENGTH(reptitle)!=CHAR_LENGTH(reptitle)

Then of course just drop the spare column

ALTER TABLE images DROP COLUMN reptitle

Also make sure (since I use PHP and this had tripped me up a couple of times so I thought I'd mention it here) all your script files are UTF8 (without BOM) and you are using:

mysql_set_charset('utf8', $connection);

Et voilà... perfectly repaired data, all languages :)

Wheelock answered 2/6, 2012 at 2:36 Comment(1)
Thanks for this - the method of adding a column and then copying them across is a good one, and laying it out like this is helpful.Ending
D
2

In addition to Raúl Avila Solano and acseven's answer if you want to update all the broken characters in one query you can do:

update `table` set field = replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(field,'&uuml;','ü'),'&ocirc;','ô'),'&oacute;','ó'),'&ecirc;','ê'),'&agrave;','à'),'&atilde;','ã'),'&Uuml;','Ü'),'&Ocirc;','Ô'),'&Oacute;','Ó'),'&Ecirc;','Ê'),'&Agrave;','À'),'&Atilde;','Ã'),'&Ccedil;','Ç'),'&Uacute;','Ú'),'&Otilde;','Õ'),'&Iacute;','Í'),'&Iacute;','Í'),'&Eacute;','É'),'&Acirc;','Â'),'&Aacute;','Á'),'&ccedil;','ç'),'&uacute;','ú'),'&otilde;','õ'),'&iacute;','í'),'&eacute;','é'),'&acirc;','â'),'&aacute;','á'),'&atilde;','ã'),'&ccedil;','ç'),'à ','à'),'à ','à'),'º','º'),'ª','ª'),'ç','ç'),'–','–'),'ó','ó'),'é','é'),'á','á'),'ê','ê'),'ã','ã'),'â','â'),'í','í'),'õ','õ'),'Ø','Ø'),'•','-'),'ú','ú'),'à ','À'),'Ã','Ã'),'Ç','Ç'),'â€','"'),'“','"'),'É','É');
Distressed answered 18/12, 2013 at 0:9 Comment(0)
A
2

this also solved my problem on some italian chars

UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'á','á');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'ä','ä');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'é','é');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'í©','é');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'ó','ó');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'íº','ú');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'ú','ú');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'ñ','ñ');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'í‘','Ñ');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'Ã','í');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'–','–');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'’','\'');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'…','...');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'–','-');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'“','"');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'â€','"');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'‘','\'');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'•','-');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name`,'‡','c');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'Â','');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'í ','à');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'í¨','è');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'íˆ','È');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'€','€');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'eÌ€','è');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'í²','ò');
UPDATE `table_name` SET `column_name` = REPLACE(`column_name` ,'í¹','ù');
Amphitrite answered 23/10, 2015 at 8:0 Comment(0)
A
2

You might have rows with properly encoded UTF8 and with wrongly encoded characters. In this case "CONVERT(BINARY CONVERT(post_title USING latin1) USING utf8)" will trim some fields.

I ended up doing it this way

update `table` set `name` = replace(`name` ,CONVERT(BINARY "ä" USING latin1),'ä');
update `table` set `name` = replace(`name` ,CONVERT(BINARY "ö" USING latin1),'ö');
update `table` set `name` = replace(`name` ,CONVERT(BINARY "ü" USING latin1),'ü');
update `table` set `name` = replace(`name` ,CONVERT(BINARY "Ä" USING latin1),'Ä');
update `table` set `name` = replace(`name` ,CONVERT(BINARY "Ö" USING latin1),'Ö');
update `table` set `name` = replace(`name` ,CONVERT(BINARY "Ü" USING latin1),'Ü');
update `table` set `name` = replace(`name` ,CONVERT(BINARY "ß" USING latin1),'ß');
Acidify answered 3/11, 2017 at 23:51 Comment(1)
This helped me so much! :3 Thank you!Arezzini
R
1

This is an extension of @Thales Ceolin's answer in order to modify every table in the db:

select concat(
    "update ", 
    a.TABLE_NAME, 
    " set ", b.COLUMN_NAME, 
    " = CONVERT(BINARY CONVERT(", 
    b.COLUMN_NAME, 
    " USING latin1) USING utf8) where ",
    b.COLUMN_NAME, 
    " is not null;") query
from INFORMATION_SCHEMA.TABLES a
left join INFORMATION_SCHEMA.COLUMNS b on a.TABLE_NAME = b.TABLE_NAME
where a.table_schema = 'db_name'
and a.TABLE_TYPE = 'BASE TABLE'
and b.data_type in ('text', 'varchar')
and a.TABLE_NAME = 'table_name';

This will result in:

update table_name set idn = CONVERT(BINARY CONVERT(idn USING latin1) USING utf8) where idn is not null;
update table_nameset name = CONVERT(BINARY CONVERT(name USING latin1) USING utf8) where name is not null;
update table_name set primary_last_name = CONVERT(BINARY CONVERT(primary_last_name USING latin1) USING utf8) where primary_last_name is not null;
Resound answered 19/4, 2018 at 22:58 Comment(0)
M
1

Based on data in this post https://www.i18nqa.com/debug/utf8-debug.html I'd suggest this is a good query of identifying dodgy entries and possible correct values:

SELECT my_field,CONVERT(BINARY CONVERT(my_field USING latin1) USING utf8mb4) AS new_field_value FROM my_table WHERE my_field REGEXP '[âÆËÅÂÃ]';

Be very careful because we had a bad encoding of a file name, but an OK encoding of the path, and in that case some of the solutions above would have caused a world of pain. If some of your data is already correctly encoded in UTF8 you'll likely find you lost a chunk of it.

Muffle answered 26/11, 2018 at 12:31 Comment(0)
W
0

The middle step may not have been necessary since TEXT and BLOB are the same.

This had the effect of correcting all incorrectly encoded data but data which was previously correct was truncated at the first multibyte character.

Woodard answered 20/9, 2012 at 3:46 Comment(0)
A
0

As the main question was on detecting broken chars my solution: (To prevent normal charset double encoding)

  1. Detection ( latin1 to utf8 )
SELECT name FROM %table% 
 WHERE 
CONVERT(CONVERT(name USING BINARY) USING utf8 ) != CONVERT(CONVERT(CONVERT(CONVERT(name USING BINARY) USING latin1) USING BINARY) USING utf8);
  1. Update( latin1 to utf8 )
UPDATE %table% SET name = convert(cast(convert(name using latin1 ) as binary) using utf8 )
 WHERE 
CONVERT(CONVERT(name USING BINARY) USING utf8 ) != CONVERT(CONVERT(CONVERT(CONVERT(name USING BINARY) USING latin1) USING BINARY) USING utf8);
Avisavitaminosis answered 11/1, 2019 at 11:24 Comment(0)
C
0

This query helped me identify rows that had bad characters in them. Basically you identify where the field isn't null, then convert to UTF8 and check for null after conversion.

select ach.*
from ach_warehouse ach
where addendum is not null and convert(addendum using utf8) is null;
Cufic answered 13/10, 2020 at 23:45 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.