How do I convert a column to ASCII on the fly without saving to check for matches with an external ASCII string?
Asked Answered
L

4

5

I have a member search function where you can give parts of names and the return should be all members having at least one of username, firstname or lastname matching that input. The problem here is that some names have 'weird' characters like the é in Renée and the user doesn't wanna type the weird character but the normal ASCII substitute e.

In PHP I convert the input string to ASCII using iconv (just in case someone types weird characters). In the database however I should also convert the weird chars to ASCII (obviously) for the strings to match.

I tried the following:

SELECT
  CONVERT(_latin1'Renée' USING ascii) t1, 
  CAST(_latin1'Renée' AS CHAR CHARACTER SET ASCII) t2;

(That's two tries.) Both don't work. Both have Ren?e as output. The question mark should be an e. It's alright if it outputs Ren?ee since I can just remove all question marks after the convert.

As you can imagine, the columns I want to query are encoded Latin1.

Thanks.

Lack answered 20/11, 2010 at 17:28 Comment(2)
Is this possible in some other way? Too much freedom is fine, I'll just filter the results with PHP strictly after the filter in MySQL. Eg. I'm good with getting "Renee", "Renée" and "Renäe" and "Renõe" as MySQL results when inputting "Renee". Any way??Lack
The best way would be to store the matches. Conversion on the fly is extremely slow for huge data because it isn't index-smart.Enounce
W
7

You don't need to convert anything. Your requirement is to compare two strings and ask if they're equal, ignoring accents; the database server can use a collation to do that for you:

Non-UCA collations have a one-to-one mapping from character code to weight. In MySQL, such collations are case insensitive and accent insensitive. utf8_general_ci is an example: 'a', 'A', 'À', and 'á' each have different character codes but all have a weight of 0x0041 and compare as equal.

mysql> SET NAMES 'utf8' COLLATE 'utf8_general_ci';
Query OK, 0 rows affected (0.00 sec)

mysql> SELECT 'a' = 'A', 'a' = 'À', 'a' = 'á';
+-----------+-----------+-----------+
| 'a' = 'A' | 'a' = 'À' | 'a' = 'á' |
+-----------+-----------+-----------+
|         1 |         1 |         1 |
+-----------+-----------+-----------+
1 row in set (0.06 sec)
Wino answered 24/12, 2010 at 11:33 Comment(3)
the above is not true, the data storage is latin1, and OP might not apply UTF-8 in page encodingIrremovable
Unluckily(?) that doesn't work. I try select id from members where lastname like 'test6e%' (with a value for lastname of 'test6ë'). No records returned. The database, table and column are UTF8. When I try select 'Reneé' = 'Renee', 'Renëe' = 'Renee'; they do return true. Weird?Lack
@Rudie: Works fine in my environment, but look at my answer for explanations/precisions.Lamppost
C
6

First off, it should work this way:

SELECT * FROM `test` WHERE `name` COLLATE utf8_general_ci LIKE '%renee%';

Where the test table is:

+-----+--------+
| id  | name   |
+-----+--------+
|  1  | Renée  |
|  2  | Renêe  |
|  3  | Renee  |
+-----+--------+

What is your MySQL version, and how do you try to match things?


One of the other possible solutions is transliteration.

Related: PHP Transliteration

Transliterating the input should not be a problem, but transliterating the values from the permanent storage (e.g. db) real-time during the search may not be feasible. So you can add three more fields like: username_slug, firstname_slug and lastname_slug. When inserting/modifying a record, set the slug values appropriately. And when searching, search the transliterated input against that slug fields.

+------+----------+---------------+----------+---------------+ ...
| id   | username | username_slug | lastname | lastname_slug | ...
+------+----------+---------------+----------+---------------+ ...
|    1 | Renée    |    renee      | La Niña  | la-nina       | ...
|    2 | Renêe    |    renee      | ...      | ...           | ...
|    3 | Renee    |    renee      | ...      | ...           | ...
+------+----------+---------------+----------+---------------+ ...

A search for "renee" or "renèe" would match all of the records.

As a side effect, you may be able to use that fields for generating SEF (search engine friendly) links, hence they are named ,..._slug, e.g. example.com/users/renee. Of course, in that case you should check for the uniqueness of the slug field.

Cidevant answered 26/12, 2010 at 14:11 Comment(2)
It could be a good idea at the beginning, but this type of setup is an update nightmare... unless you have a solid framework and no one will ever update directly the database manually. I would really prefer a real-time transliteration instead of keeping it into the database.Lamppost
Normally yeah, but de-normalization is usually on the way if performance starts to bother :)Gleesome
C
3

The CAST() operator in the context of character encodings translates from one method of character storage to another — it does not change the actual characters, which is what you are after. An é character is what it is in any character set, it is not an e. You need to convert accented characters to non-accented characters, which is a different issue and has been asked a number of times previously (normalizing accented characters in MySQL queries).

I am unsure if there is a way to do this directly in MySQL, short of having a translation table and going through letter by letter. It would most likely be easier to write a PHP script to go through the database and make the translations.

Crumb answered 20/11, 2010 at 17:51 Comment(2)
I don't want to save the 'translations'. The translations are for searching only. The display results should be as they are: with accents and other exotic chars. If PHP has a function for it (iconv), why not MySQL? I don't (want to) believe that!Lack
@Lack PHP has a great many functions MySQL does not have. SQL languages tend to be very lightweight in terms of their standard library. It would be possible to write a function to achieve what you desire, though the performance might not be fantastic unless you wrote it as a UDF, or native function (though with the latter, you could just call the iconv library).Crumb
L
3

@vincebowdren answer above works, I'm just adding this as an answer for formatting purposes:

CREATE TABLE `members` (
  `id` int(11) DEFAULT NULL,
  `lastname` varchar(20) COLLATE utf8_unicode_ci DEFAULT NULL
);
insert into members values (1, 'test6ë');
select id from members where lastname like 'test6e%';

Yields

+------+
| id   |
+------+
|    1 |
+------+

And using Latin1,

set names latin1;
CREATE TABLE `members2` (
  `id` int(11) DEFAULT NULL,
  `lastname` varchar(20) CHARACTER SET latin1 DEFAULT NULL
);
insert into members2 values (1, 'Renée');
select id from members2 where lastname like '%Renee%';

will yield:

+------+
| id   |
+------+
|    1 |
+------+

Of course, the OP should have the same charset in the application (PHP), connection (MySQL on Linux used to default to latin1 in 5.0, but defaults to UTF8 in 5.1), and in the field datatype to have less unknowns. Collations take care of the rest.

EDIT: I wrote should to have a better control over everything, but the following also works:

set names latin1;
select id from members where lastname like 'test6ë%';

Because, once the connection charset is set, MySQL does the conversion internally. In this case, it will convert somehow convert and compare the UTF8 string (from DB) to the latin1 (from query).

EDIT 2: Some skepticism requires me to provide an even more convincing example:

Given the statements above, here what I did more. Make sure the terminal is in UTF8.

set names utf8;
insert into members values (5, 'Renée'), (6, 'Renêe'), (7, 'Renèe');
select members.id, members.lastname, members2.id, members2.lastname
from members inner join members2 using (lastname);

Remember that members is in utf8 and members2 is in latin1.

+------+----------+------+----------+
| id   | lastname | id   | lastname |
+------+----------+------+----------+
|    5 | Renée    |    1 | Renée    |
|    6 | Renêe    |    1 | Renée    |
|    7 | Renèe    |    1 | Renée    |
+------+----------+------+----------+

which proves with the correct settings, the collation does the work for you.

Lamppost answered 25/12, 2010 at 6:31 Comment(10)
@Lamppost - is not true, what you are compare on Renée is an UTF8, not a latin1 (iso-8859-*)Irremovable
@ajreal: It is translated. I tested it before posting. You need to configure your environment accordingly (terminal, connection, and collation). Have you tested it before saying I'm wrong?Lamppost
@Lamppost - obviously you get it wrong...your method is assuming characters are all in utf-8, which is not true. my hints to you, use char_lengthIrremovable
@ajreal: You are reading it wrong. One table is in utf8 and the other is in latin1. Please read carefully before saying I assume anything. Have you tried before saying I'm wrong? Use a terminal not PHP. Set your environment properly for it to work.Lamppost
@Danasaure - sigh,the user input can be anything, it does not limited to just utf-8, who telling you is utf-8? Obviously, is a iso-8859-*. And more importantly, it does not matter how it store in the database, is about how input compare with databaseIrremovable
@ajreal: Obviously you don't want to understand. My example use collation to prove an utf8 string compares properly to a latin1 string. Obviously a latin1 string does not need any conversion to do what the OP needs to do if the connection charset is latin1. Perhaps you don't understand how collations work to argue this much. I'm not continuing this pointless discussion if you keep focusing on the charset because you are not proving it does not work. You are wrong because storage charset matters.Lamppost
@Lamppost I'm sure your tests work, but in my environment it doesn't. I've tried SET NAMES 'utf8' COLLATE 'utf8_general_ci'; and SET NAMES 'utf8'; before quering the members table. The input string is ASCII (PHP makes sure of that). All database (columns/tables/database) is UTF8. I don't know about the 'connection' (how would I know?). The query select id, lastname from members where lastname like 'test6e%'; yields no results.Lack
I've ran a few more tests and something strange happens when saving the lastnames (and other colums I assume) into the database. When I exec update members set lastname = 'test6é' where id = 226948; on a UTF8 page and then query lastname with that ID, it returns 'test6é'. When I save it in PhpMyAdmin (also on a UTF8 page), it saves and returns the right value: 'test6é'. What happens?? Both pages use the same MySQLi connection and have a <meta charset=utf8> header (in HTML and HTTP). (The values are shown correctly in my app tho!)Lack
@Rudie: There must be a connection parameter for specifying the charset. Sorry I am not PHP fluent, only MySQL. Just wondering, why in OP you used latin1 and your testings in UTF8?Lamppost
@Rudie: if you want to set your php's mysqli connection to use a specific encoding (ideally utf-8), you can use the mysqli::set_charset method. See php.net/manual/en/mysqli.set-charset.php for details. I use this in my php, after discovering that without it I had been mistakenly saving badly encoded data into my database.Wino

© 2022 - 2024 — McMap. All rights reserved.