MySQL Convert latin1 data to UTF8
Asked Answered
D

7

9

I imported some data using LOAD DATA INFILE into a MySQL Database. The table itself and the columns are using the UTF8 character set, but the default character set of the database is latin 1. Because the default character type of the database is latin1, and I used LOAD DATA INFILE without specifying a character set, it interpreted the file as latin1, even though the data in the file was UTF8. Now I have a bunch of badly encoded data in my UTF8 colum. I found this article which seems to address a similar problem, which is "UTF8 inserted in cp1251", but my problem is "Latin1 inserted in UTF8". I've tried editing the queries there to convert the latin1 data to UTF8, but can't get it to work. Either the data comes out the same, or even more mangled than before. Just as an example, the word Québec is showing as Québec.

[ADDITIONAL INFO]

When Selecting the data wrapped in HEX(), Québec has the value 5175C383C2A9626563.

The Create Table (shortened) of this table is.

CREATE TABLE MyDBName.`MyTableName`
(
`ID` INT NOT NULL AUTO_INCREMENT, 
.......
`City` CHAR(32) NULL, 
.......
`)) ENGINE InnoDB CHARACTER SET utf8;
Durkin answered 17/9, 2009 at 19:20 Comment(1)
please post the CREATE TABLE statement for the table in question, along with a couple of the broken rows, but wrap the broken column in hex(), like this: SELECT HEX(name) FROM cities LIMIT 5. with this information, i can help you figure out the correct way to fix it according to that article. (BTW: i love that article! it's saved my butt quite a few times.)Kayak
H
12

I've had cases like this in old wordpress installations with the problem being that the data itself was already in UTF-8 within a Latin1 database (due to WP default charset). This means there was no real need for conversion of the data but the ddbb and table formats. In my experience things get messed up when doing the dump as I understand MySQL will use the client's default character set which in many cases is now UTF-8. Therefore making sure that exporting with the same coding of the data is very important. In case of Latin1 DDBB with UTF-8 coding:

$ mysqldump –default-character-set=latin1 –databases wordpress > m.sql

Then replace the Latin1 references within the exported dump before reimporting to a new database in UTF-8. Sort of:

$ replace "CHARSET=latin1" "CHARSET=utf8" \
    "SET NAMES latin1" "SET NAMES utf8" < m.sql > m2.sql

In my case this link was of great help. Commented here in spanish.

Hypsography answered 21/12, 2009 at 12:42 Comment(3)
I've seen MySQL dumps where this replace command wasn't sufficient because some columns were explicitly set to latin1. I did replace "latin1" "utf8mb4" <dump.latin1.sql >dump.utf8.sql to make everything in that table use UTF-8. NOTE however that "latin1" did not occur anywhere else in the dump (field contents) and, just to make sure, I checked the diff before importing it.Diazomethane
This answer basically solved it for me. I'm importing to MySQL v8 so character encoding is utf8mb4 instead of utf8. I also faced an ERROR 1071 (42000): Specified key was too long; max key length is 1000 bytes when importing. I can't change to InnoDB, so the solution here was to ensure varchar fields were <= 250 because utf8mb4 can be up to 4 bytes per character, ie: 4 * 250 = 1000. So I ran another search and replace, eg (using sed to replace m2.sql dump): sed -i "s/varchar(254)/varchar(250)/g" m2.sql. You might need to handle other values greater than 250, eg: 255, 256, etcIncite
Additionally I had to use -r instead of >. Credit to this commentIncite
L
7

Though it is hardly still actual for the OP, I happen to have found a solution in MySQL documentation for ALTER TABLE. I post it here just for future reference:

Warning

The CONVERT TO operation converts column values between the character sets. This is not what you want if you have a column in one character set (like latin1) but the stored values actually use some other, incompatible character set (like utf8). In this case, you have to do the following for each such column:

ALTER TABLE t1 CHANGE c1 c1 BLOB;
ALTER TABLE t1 CHANGE c1 c1 TEXT CHARACTER SET utf8;

The reason this works is that there is no conversion when you convert to or from BLOB columns.

Laurin answered 17/1, 2012 at 16:52 Comment(1)
With mixed content this does not work: Code: 1366 SQL State: HY000 --- Incorrect string value: '\xE4chste...' for column 'kommentar' at row 1Encomiast
K
2

LOAD DATA INFILE allows you to set an encoding file is supposed to be in:

http://dev.mysql.com/doc/refman/5.1/en/load-data.html

Kathlyn answered 17/9, 2009 at 19:46 Comment(2)
Yeah, I wish I would have realized this before hand, but now the data is already mangled. I wanted to know if I could fix it without reimporting it.Durkin
Yes! This is the option: [CHARACTER SET charset_name]Kurtis
I
2

I wrote that http://code.google.com/p/mysqlutf8convertor/ for Latin Database to UTF-8 Database. All tables and field to change UTF-8.

Ious answered 27/12, 2009 at 4:16 Comment(1)
You might want to import into github. Google Code is read-only nowDichromatic
R
1

Converting latin1 to UTF8 is not what you want to do, you kind of need the opposite.

If what really happened was this:

  1. UTF-8 strings were interpreted as Latin-1 and transcoded to UTF-8, mangling them.
  2. You are now, or could be, reading UTF-8 strings with no further interpretation

What you must do now is:

  1. Read the "UTF-8" with no transcode.
  2. Convert it to Latin-1. Now you should actually have the original UTF-8.
  3. Now put it in your "UTF-8" column with no further conversion.
Radiolocation answered 17/9, 2009 at 19:35 Comment(0)
R
1

I recently completed a shell script that automates the conversion process. It is also configurable to write custom filters for any text you wish to replace or remove. For example : stripping HTML characters etc. Table whitelists and blacklists are also possible. You can download it at sourceforge: https://sourceforge.net/projects/mysqltr/

Ragsdale answered 20/3, 2013 at 21:36 Comment(0)
S
0

Try this:

1) Dump your DB

mysqldump --default-character-set=latin1 -u username -p databasename < dump.sql

2) Open dump.sql in text editor and replace all occurences of "SET NAMES latin1" by "SET NAMES utf8"

3) Create a new database and restore your dumpfile

cat dump.sql | mysql -u root -p newdbname
Slating answered 16/5, 2012 at 15:36 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.