How to detect UTF-8 characters in a Latin1 encoded column - MySQL
Asked Answered
S

4

21

I am about to undertake the tedious and gotcha-laden task of converting a database from Latin1 to UTF-8.

At this point I simply want to check what sort of data I have stored in my tables, as that will determine what approach I should use to convert the data.

Specifically, I want to check if I have UTF-8 characters in the Latin1 columns, what would be the best way to do this? If only a few rows are affected, then I can just fix this manually.

Option 1. Perform a MySQL dump and use Perl to search for UTF-8 characters?

Option 2. Use MySQL CHAR_LENGTH to find rows with multi-byte characters? e.g. SELECT name FROM clients WHERE LENGTH(name) != CHAR_LENGTH(name); Is this enough?

At the moment I have switched my Mysql client encoding to UTF-8.

Scathing answered 16/2, 2012 at 2:3 Comment(9)
By definition, you cannot store UTF-8 data in Latin1 columns. Care to give some more context for your question?Thanos
UTF-8 multibyte characters are all greater than number 128. But there is actually no way to determine what the intent of the character was: I think what you meant to ask was "can I detect non-ASCII characters in a Latin1 encoded column". Obviously, since byte sequence 0xF0 0x53 means two different things in UTF-8 and in Latin1, you won't know which it is even after finding it...Thoma
@Thanos You can inadvertently store UTF-8 data in a LATIN1 column because LATIN1 is an 8-bit character set. It just ends up looking like a mess in the wrong encoding.Jughead
@Jughead You mean the common mistake I described in kunststube.net/frontback? That doesn't mean you're storing UTF-8 data though, it means you're storing data which the database thinks is Latin1 but the client interprets as UTF-8. Whether that's what's meant here or not is hard to tell though, hence the question to the OP.Thanos
That's what I'm talking about. The database is convinced the data is LATIN1, but the client interprets it as UTF-8 anyway. You can put UTF-8 into a LATIN1 database by accident, but you can't put LATIN1 into UTF-8 because invalid UTF-8 characters will get truncated. There's no such thing as an invalid LATIN1 character.Jughead
@Thoma 0xF0 0x53 is not a valid UTF8 sequence, which is why this is quite doable, up to a point.Owsley
@triplee Curses, should have chosen one of the 30,000+ ambiguous two-byte sequences instead! But I think you understand my point. There are some characters which cannot be UTF-8, but many latin1 two-character sequences are also a valid UTF-8 two-byte character.Thoma
@Thoma Of course there are corner cases, but in the real world, this is quite doable. It's not hard to see that Björn is an erroneous UTF8 encoding of Björn. The ambiguities within the Latin-1 space begin with an accented uppercase A followed by a nonalphabetic, which is quite unlikely to occur in real-world textual data (unless your parents read XKCD and decided to name you Möjibake :-)Owsley
@Owsley And the byte sequence 0xC2A0, which is a nonbreaking space in UTF-8 and an  followed by a nonbreaking space in latin1? The wonderful disappearing character!Thoma
J
57

Character encoding, like time zones, is a constant source of problems.

What you can do is look for any "high-ASCII" characters as these are either LATIN1 accented characters or symbols, or the first of a UTF-8 multi-byte character. Telling the difference isn't going to be easy unless you cheat a bit.

To figure out what encoding is correct, you just SELECT two different versions and compare visually. Here's an example:

SELECT CONVERT(CONVERT(name USING BINARY) USING latin1) AS latin1, 
       CONVERT(CONVERT(name USING BINARY) USING utf8) AS utf8 
FROM users 
WHERE CONVERT(name USING BINARY) RLIKE CONCAT('[', UNHEX('80'), '-', UNHEX('FF'), ']')

This is made unusually complicated because the MySQL regexp engine seems to ignore things like \x80 and makes it necessary to use the UNHEX() method instead.

This produces results like this:

latin1                utf8
----------------------------------------
Björn                Björn
Jughead answered 16/2, 2012 at 3:52 Comment(5)
Apologies for a late response and vague initial question. Awarded this answer because it more or less helped me in detecting characters where the likely intention was a UTF8 character. Upvoted deceze's answer because it contained situations I have elsewhere in the databaseScathing
Awesome -- this little nugget helped me fix an issue where utf8 encoded data was inserted to a utf8 table but interpreted as latin1 because I entered it over the mysql CLI... funny though because since the system was set to UTF8 it looked fine when inputting and selecting (just not when being decoded and rendered on the associated website).Peregrination
Sometimes if you read and write data from two connections with precisely the same misconfiguration, it magically works. Two wrongs, on occasion, do make a right.Jughead
The first byte of an UTF-8 encoded codepoint above the ASCII range is in range 0xC2-0xF4 (U+0080 starts with byte 0xC2; U+10FFFF starts with 0xF4). So the range in this answer could be more restrictive to reduce false positives.Deodorize
This also gets false positive results => latin1 "é" letter ( utf char : 195, ansi char : 233 ), is there a workaround with this ?Gardiner
T
10

Since your question is not completely clear, let's assume some scenarios:

  1. Hitherto wrong connection: You've been connecting to your database incorrectly using the latin1 encoding, but have stored UTF-8 data in the database (the encoding of the column is irrelevant in this case). This is the case I described here. In this case, it's easy to fix: Dump the database contents to a file through a latin1 connection. This will translate the incorrectly stored data into incorrectly correctly stored UTF-8, the way it has worked so far (read the aforelinked article for the gory details). You can then reimport the data into the database through a correctly set utf8 connection, and it will be stored as it should be.
  2. Hitherto wrong column encoding: UTF-8 data was inserted into a latin1 column through a utf8 connection. In that case forget it, the data is gone. Any non-latin1 character should be replaced by a ?.
  3. Hitherto everything fine, henceforth added support for UTF-8: You have Latin-1 data correctly stored in a latin1 column, inserted through a latin1 connection, but want to expand that to also allow UTF-8 data. In that case just change the column encoding to utf8. MySQL will convert the existing data for you. Then just make sure your database connection is set to utf8 when you insert UTF-8 data.
Thanos answered 16/2, 2012 at 5:51 Comment(3)
If multiple clients have been adding data, and some of them assumed they should submit utf8, you would get an unholy mix which basically needs to be sorted out manually. That doesn't mean you cannot automate parts of the process, and in fact the majority of cases can probably be decided without human intervention.Owsley
True, but then you're really completely foobar'd. Before even attempting to answer that scenario, the OP would need to give a lot more information about the actual problem at hand.Thanos
For case 1, the command that worked for me was mysqldump --default-character-set=latin1 -u user -p database. I then had to go into the dump file and change SET NAMES latin1 to utf8. Then reimport the dump file and all fixed.Avantgarde
S
2

There is a script on github to help with this sort of a thing.

Scrouge answered 28/1, 2013 at 19:29 Comment(4)
This script worked very nicely for me, and I contributed some improvements to make it much faster and more flexible. I've also got a branch which converts to MySQL's 'proper' utf8mb4 charset.Mukerji
This script worked.. still dont understand how it worked tho.. need to go through it sometime.. Moved from latin1 - utf8 almost painlessly, had to add mysql_set_charset("utf8"); for php to use it right after that.Bonbon
The OP asked how to detect UTF-8 characters in Latin1 columns. AFAICT, the mysql-convert-latin1-to-utf8 script, at present, does not actually help with that. Instead, it has a user-modifiable $collationMap array specifying a set of key-value pairs of collations. For each column whose collation matches one of the keys, the script will blindly assume its contents are encoded with the charset corresponding to the value's collation. The script changes the column's collation (and, implicitly, character set) to the latter, while preserving the binary value of the contents.Vasoconstrictor
Well, detection is kinda impossible. Someone jammed a square peg into a round hole and asked why it's not square when it comes back out...You can kinda tell because the data that comes out doesn't quite look right, but you have to look at it to know (or identify common characters sequences that when interpreted as UTF-8 are more likely than the sequence of latin characters).Scrouge
O
0

I would create a dump of the database and grep for all valid UTF8 sequences. Where to take it from there depends on what you get. There are multiple questions on SO about identifying invalid UTF8; you can basically just reverse the logic.

Edit: So basically, any field consisting entirely of 7-bit ASCII is safe, and any field containing an invalid UTF-8 sequence can be assumed to be Latin-1. The remaining data should be inspected - if you are lucky, a handful of obvious substitutions will fix the absolute majority (replace ö with Latin-1 ö, etc).

Owsley answered 16/2, 2012 at 5:34 Comment(1)
This answer contains a fairly long list of the likely bad combinations.Mukerji

© 2022 - 2024 — McMap. All rights reserved.