Comparing strings in PHP the same way MySQL does

Asked 22/1, 2009 at 22:5 Answered 19/9, 2013 at 7:50

Solved php mysql utf-8 collation mysql-error-1062

I'm storing a varchar in a utf8 MySQL table and using utf8_general_ci collation. I have a unique index on the varchar. I'd like to do a string comparison in PHP that is equivalent to what MySQL will do on the index.

A specific example is that I'd like to be able to detect that 'a' is considered equivalent to 'À' in PHP before this happens:

mysql> insert UniTest (str) values ('a');                                   
Query OK, 1 row affected (0.00 sec)

mysql> insert UniTest (str) values ('À');                                   
ERROR 1062 (23000): Duplicate entry 'À' for key 1

Sternick answered 22/1, 2009 at 22:5 Comment(0)

The collation has nothing to do with the storage. You need to set the charset to determine the storage encoding. The collation governs how comparison and sorting should happen. The collation must be charset aware, but otherwise it has nothing to do with the charset.

To answer your question, you can use iconv to translitter the text, and then compare it. For example:

function compare($s1, $s2) {
  return strcmp(
    iconv('UTF-8', 'ISO-8859-1//TRANSLIT', $s1),
    iconv('UTF-8', 'ISO-8859-1//TRANSLIT', $s2));
}

This is basically what MySql will do for you, although it's probably faster and it may have a slightly different collation-table than ISO-8859-1//TRANSLIT. Not entirely sure about that.

Would probably be easier to use the database though, as others have already suggested.

Frankfort answered 23/1, 2009 at 9:25 Comment(7)

I edited the question to accurately reflect the collation vs charset for storage. Can you give me an example of using mb_string to implement this? I don't see a mb_strcmp in the docs. – Sternick 23/1, 2009 at 16:12

I made a mistake - You need to use iconv - not mb_string. I have edited my answer. – Frankfort 24/1, 2009 at 12:35

This is getting closer. the //TRANSLIT trick is good, but as you mention, ISO-8859-1 isn't what I need. MySQL is using a collation that looks like this, which I would need to emulate somehow: collation-charts.org/mysql60/… – Sternick 24/1, 2009 at 19:31

If the rather limited functionality of Iconv doesn't suit you, you can try with this extension: derickrethans.nl/translit.php or this (plain php): sitepoint.com/blogs/2006/03/03/… – Frankfort 24/1, 2009 at 21:18

Both allows you to use a userdefined translitteration database. You could copy MySqls if you need an exact match. But it'd probably be easier to just use the database in the first place ;) – Frankfort 24/1, 2009 at 21:19

Also look into unicode normalization form c (iirc). I recently did this for something in python. It'll break the character into its composite pieces, then you can strip out non-ascci. "<accent A>" => "A<unicode accent>" => "A" – Mister 4/2, 2009 at 18:12

The sitepoint link posted by @Frankfort is broken by now. Here is the replacement: sitepoint.com/us-ascii-transliterations-of-unicode-text – Unburden 9/10, 2012 at 9:6

Would it be reasonable just to let MySQL do the work, by submitting a query to MySQL like:

SELECT CASE WHEN '$a' = '$b' THEN 1 ELSE 0 END

EDIT post clarification:

You could one-time iterate through the entire character set of interest cartesian joined to itself and build a standard php associative array of equivalence sets.

    for each $char1 in $charset {  
        for each $char2 in $charset {  
            $charmatch[$char1][$char2] = mysqlTestMatch($char1, $char2));  
        }  
    }

Then you'd need to test each string character by character, to see if a) they are the same, or if not, b) they are equivalent.

Rochelle answered 22/1, 2009 at 22:10 Comment(1)

I'm doing this with very large numbers of strings for lots of different reasons, so I'd like to avoid the database. – Sternick 28/1, 2009 at 23:46

Why don’t you just let MySQL decide whether there already is a record with the same key?

You could run a SELECT query to ask if there is already a record with this attribute:

SELECT 1
FROM UniTest
WHERE str = "À"

Or you just give it a try inserting the new record and use the functions mysql_error() and mysql_errno() to see if an error occured.

Coss answered 22/1, 2009 at 22:14 Comment(0)

Use intl's Collator or Transliterator.

$s1 = 'a';
$s2 = 'À';

var_dump(
    is_same_string($s1, $s2),
    $s1 === transliterator_transliterate('Any-Latin; Latin-ASCII; Lower()', $s2)
);

function is_same_string($str, $str2, $locale = 'en_US')
{
    $coll = collator_create($locale);
    collator_set_strength($coll, Collator::PRIMARY);  
    return 0 === collator_compare($coll, $str, $str2);
}

Yeung answered 19/9, 2013 at 7:50 Comment(0)

So, if I get it correctly, you want to do a similar compare in PHP as you would get in a check against a UTF-8 General index check in MySQL?

The easiest thing would be to create a helper function that would convert a string according to utf8_general_ci rules used by MySSQL, which is mainly to convert certain letters to a base letter.

The rules for that MySQL collation is listed here:

http://www.collation-charts.org/mysql60/mysql604.utf8_general_ci.european.html

For example, if you scroll down just a bit to the "gold A" on the left, you'll see all the characters that get converted to that A.

Given a helper function, called for example utf8g_to_ascii(), you could write a function:

function utf8_compare($s1, $s2) {
   $a = utf8g_to_ascii($s1);
   $b = utf8g_to_ascii($s2);
   return strcmp( $a, $b );
}

I would model my code after:

http://dev.splitbrain.org/view/darcs/dokuwiki/inc/utf8.php

Jehiel answered 4/2, 2009 at 20:33 Comment(0)

Recommended topics

Hot tags