mysqldump exporting data in a bad character set
Asked Answered
W

2

6

Yesterday for the first time I exported my Mysql database and I found some very strange characters in the dump such as:

INSERT INTO `piwik_archive_blob_2013_01` VALUES (15,'Actions_actions_url_6',1,'2013-01-17','2013-01-17',1,'2013-01-20 07:36:53','xuNM0ý/œ#&ÝÕ³\ZõNYpÊÀì#!üw7Hж}°ÀAáZoN*šgµ\'GWª[Yûðe¯57 ÃÁÆ7|Ÿ\'Ü%µDh©-EÛ^ËL±ÕÞtªk@(,b±ßZ.ÒÃ6b²aiÓÍ)87[­ïÎœ,æya¥uÒ<|+íª7MNuïÝ¿8ñ%1Ʊ>Ú­X');

The version of my server MySQL is: 5.1.66-0+squeeze1 (Debian). This database was created automatically by the Piwik setup script.

Here is what I tried to fix this problem:

#1 First I checked the database charset.

> show table status;

The 26 tables has the collation utf8_general_ci which sounded quite normal. I guessed mysqldump exported in a different charset (latin1?) So I tried:

mysqldump -u user -p**** --all-databases --default-character-set=utf8 | gzip -9 > dump.sql.gz

Result = I still had the same strange characters.

Note) later I learned that the default charset for mysqldump is utf8, regardless of server default charset. So --default-character-set=utf8 was useless.

#2 Then I thought I could solve the problem by updating the mysql conf. The original conf was:

mysql> show variables like "%character%";show variables like "%collation%";

+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | latin1                     |
| character_set_connection | latin1                     |
| character_set_database   | latin1                     |
| character_set_filesystem | binary                     |
| character_set_results    | latin1                     |
| character_set_server     | latin1                     |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
+----------------------+-------------------+
| Variable_name        | Value             |
+----------------------+-------------------+
| collation_connection | latin1_swedish_ci |
| collation_database   | latin1_swedish_ci |
| collation_server     | latin1_swedish_ci |
+----------------------+-------------------+

So I updated /var/lib/mysql/my.cnf and I added:

[mysqld]
init_connect='SET collation_connection = utf8_general_ci'
init_connect='SET NAMES utf8'
character-set-server=utf8
collation-server=utf8_general_ci
default-character-set=utf8
default-collation=utf8_general_ci

[mysqldump]
default-character-set=utf8

Then

/etc/init.d/mysql restart
mysql> show variables like "%character%";show variables like "%collation%";

+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8                       |
| character_set_connection | utf8                       |
| character_set_database   | latin1                     |
| character_set_filesystem | binary                     |
| character_set_results    | utf8                       |
| character_set_server     | utf8                       |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
+----------------------+-------------------+
| Variable_name        | Value             |
+----------------------+-------------------+
| collation_connection | utf8_general_ci   |
| collation_database   | latin1_swedish_ci |
| collation_server     | utf8_general_ci   |
+----------------------+-------------------+

Result = same strange characters.

#3 I changed character_set_database and collation_database:

mysql> ALTER DATABASE piwik default character SET utf8 collate utf8_general_ci;

mysql> show variables like "%character%";show variables like "%collation%";

+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8                       |
| character_set_connection | utf8                       |
| character_set_database   | utf8                       |
| character_set_filesystem | binary                     |
| character_set_results    | utf8                       |
| character_set_server     | utf8                       |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
+----------------------+-----------------+
| Variable_name        | Value           |
+----------------------+-----------------+
| collation_connection | utf8_general_ci |
| collation_database   | utf8_general_ci |
| collation_server     | utf8_general_ci |
+----------------------+-----------------+

Result = same strange characters.

#4 I understand now I should have changed the default character set in MySQL (latin1) to utf8 before creating the database.

The collation utf8_general_ci (#1) means data is stored in utf8. However, would it be possible that mysqldump thinks data is stored in latin1 and encodes data in utf8? It would mean at the end data is double utf8 encoded (sigh). In this case how could I fix the problem?

Thanks for your help.

ps) I wonder why Piwik doesn't require to change the database default charset to utf8.

Wonderwork answered 30/3, 2013 at 12:24 Comment(4)
What should be where you're seeing those characters? How exactly are you viewing these characters?Baten
I see these characters when I open the sql file with ZendStudio.Wonderwork
And what is the data you expect to be there?Baten
The solution here will work in this case as well: https://mcmap.net/q/1919921/-restore-mysql-db-with-unicode-characters-arabic-and-kurdish-in-gz-formatBolzano
A
0

According to the table name "piwik_archive_blob_2013_01", I guess the column containing the strange characters is of type BLOB.

BLOB columns contains binary data. That's why it contains these strange characters. This is expected.

Don't worry about it, I'm pretty sure MysqlDump knows how to dump this data.

Cheers, Eric.

Abdicate answered 22/3, 2014 at 22:30 Comment(2)
You'll be surprised at what mysqldump does and doesn't know... it's extremely unintelligent, and will happily produce borked files requiring careful un-borking to import.Tetragon
--hex-blob will generally work for this with binary data. The character set solutions won't work for blobs, as this answer saysGlendaglenden
L
0

It could be the operating system is changing the charset during the export and ignoring default-character-set parameter.

To make sure that the export is not using operating system charset use the parameter result-file

Lepido answered 18/3, 2017 at 12:44 Comment(1)
as of today, link seems brokenPrototherian

© 2022 - 2024 — McMap. All rights reserved.