normalizing accented characters in MySQL queries
Asked Answered
W

5

20

I'd like to be able to do queries that normalize accented characters, so that for example:

é, è, and ê

are all treated as 'e', in queries using '=' and 'like'. I have a row with username field set to 'rené', and I'd like to be able to match on it with both 'rene' and 'rené'.

I'm attempting to do this with the 'collate' clause in MySQL 5.0.8. I get the following error:

mysql> select * from User where username = 'rené' collate utf8_general_ci;
ERROR 1253 (42000): COLLATION 'utf8_general_ci' is not valid for CHARACTER SET 'latin1'

FWIW, my table was created with:

CREATE TABLE `User` (
  `id` bigint(19) NOT NULL auto_increment,
  `username` varchar(32) NOT NULL,
  PRIMARY KEY  (`id`),
  UNIQUE KEY `uniqueUsername` (`username`)
) ENGINE=InnoDB AUTO_INCREMENT=56790 DEFAULT CHARSET=utf8
Wylen answered 20/2, 2010 at 16:24 Comment(3)
What happens if you change the table character set to utf8?Gillis
friedo- isn't that what the DEFAULT CHARSET=utf8 does? (I didn't actually create this table; I reverse-engineered it via 'show create table User').Wylen
Yes, that's what it's supposed to do. I don't know why MySQL seems to think the chartset is latin1 if you have it set to utf8, though.Gillis
B
13

The reason for the error is not the table but the characterset of your input, i.e. the 'rené' in your query. The behaviour depends on the character_set_connection variable:

The character set used for literals that do not have a character set introducer and for number-to-string conversion.

Using the MySQL Client, change it using SET NAMES:

A SET NAMES 'charset_name' statement is equivalent to these three statements:

SET character_set_client = charset_name;
SET character_set_results = charset_name;
SET character_set_connection = charset_name;

(from http://dev.mysql.com/doc/refman/5.5/en/charset-connection.html)

Example output:

mysql> set names latin1;
Query OK, 0 rows affected (0.00 sec)

mysql> select * from User where username = 'rené' collate utf8_general_ci;
ERROR 1253 (42000): COLLATION 'utf8_general_ci' is not valid for CHARACTER SET 'latin1'

mysql> set names utf8;
Query OK, 0 rows affected (0.00 sec)

mysql> select * from User where username = 'rené' collate utf8_general_ci;
Empty set (0.00 sec)

Altenatively, use can explicitly set the character set using a 'character set introducer':

mysql> set names latin1;
Query OK, 0 rows affected (0.00 sec)

mysql> select * from User where username = _utf8'rené' collate utf8_general_ci;
Empty set (0.00 sec)

I know this question is pretty old but since Google led me here for a related question, I though it still deserves an answer :)

Bluet answered 3/6, 2013 at 13:52 Comment(0)
M
7

I'd suggest that you save the normalized versions to your table in addition with the real username. Changing the encoding on the fly can be expensive, and you have to do the conversion again for every row on every search.

If you're using PHP, you can use iconv() to handle the conversion:

$username = 'rené';
$normalized = iconv('UTF-8', 'ASCII//TRANSLIT', $string);

Then you'd just save both versions and use the normalized version for searching and normal username for display. Comparing and selecting will be alot faster from the normalized column, provided that you normalize the search string also:

$search = mysql_real_escape_string(iconv('UTF-8', 'ASCII//TRANSLIT', $_GET['search']));
mysql_query("SELECT * FROM User WHERE normalized LIKE '%".$search."%'");

Of course this method might not be viable if you have several columns that need normalizations, but in your specific case this might work allright.

Militant answered 20/2, 2010 at 16:34 Comment(2)
Hmm, I'm a bit leery of keeping data in multiple places (DRY), unless it proves to be a bottleneck. In this case, it would involve 3 existing fields- username, firstName, and lastName (I vastly simplified my table structure for the purpose of asking a simple question).Wylen
I using PHP not work ... u need more 1 next line str_replace("?", "%", $normalized); .. enjoy!Mythomania
L
6

I have implemented a strtr php function/tr unix command in MySQL you can get the source here

You can use as:

SELECT tr(name, 'áäèëî', 'aaeei') FROM persons

or to strip some characters

SELECT tr(name, 'áäèëî', null) FROM persons
Linette answered 22/10, 2012 at 23:58 Comment(0)
T
4
$normalized = iconv('UTF-8', 'ASCII//TRANSLIT', $string);

is a perfect php solution, but in mysql? CONVERT?

in mysql

SELECT 'Álvaro José' as accented, (CONVERT ('Álvaro José' USING ascii)) as notaccented

Produce:

Álvaro José     ?lvaro Jos?

The accented words is not converted to no accented words, it is not equivalent a translit of iconv.

RegExp don't work with UTF-8.

Not any solution.

Talkative answered 20/5, 2011 at 15:21 Comment(1)
This is not an answer but a comment with useful informationDelgado
S
1

Does a search using English characters return results with foreign characters? I wrote the following script to compare collations in MySQL 5.7 (Should also work for MariaDB 10.2+):

        $db->query('CREATE TABLE IF NOT EXISTS test (name varchar(20))
         Engine=InnoDB character set utf8mb4 collate utf8mb4_unicode_520_ci');

        $db->query('CREATE TABLE IF NOT EXISTS test2 (name varchar(20))
         Engine=InnoDB character set utf8mb4 collate utf8mb4_unicode_ci');

        $db->query("insert into test values('Łove 520')");
        $db->query("insert into test2 values('Łove 520')");

        $types = ['utf8mb4_unicode_520_ci', 'utf8mb4_unicode_ci'];
        $tables = ['test' => 'utf8mb4_unicode_520_ci', 'test2' => 'utf8mb4_unicode_ci'];
        foreach($types as $n)
        {
            foreach($tables as $ta => $tc)
            {
                $db->query("SET NAMES 'utf8mb4' COLLATE '$n'");
                $res = $db->query("Select * from $ta where name like 'Love%'"); // Ł equal
                echo "\ntable $ta($tc), names($n): ".$res->fetchColumn(0);
            }
        }

Here are the results:

table test(utf8mb4_unicode_520_ci), names(utf8mb4_unicode_520_ci): Łove 520
table test2(utf8mb4_unicode_ci), names(utf8mb4_unicode_520_ci):
table test(utf8mb4_unicode_520_ci), names(utf8mb4_unicode_ci): Łove 520
table test2(utf8mb4_unicode_ci), names(utf8mb4_unicode_ci):

(Note: I ran the script from the command line, so it appears as ┼üove 520 instead of Łove 520)

It appears that L == Ł when the table collation is utf8mb4_unicode_520_ci, regardless of the connection collation. However, it is not equivalent if you only use utf8mb4_unicode_ci.

Shue answered 18/1, 2020 at 22:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.