Why Normalizer::normalize (PHP) doesn't work?
Asked Answered
M

4

7

I'm trying to normalize strings with characters like 'áéíóú' to 'aeiou' to simplify searches.

Following the response to this question I should use the Normalizer class to do it.

The problem is that the normalize function does nothing. For example, that code:

<?php echo 'Pérez, NFC: ' . normalizer_normalize('Pérez', Normalizer::NFC) 
    . ' NFD: ' .normalizer_normalize('Pérez', Normalizer::NFD)
    . ' NFKC: ' .normalizer_normalize('Pérez', Normalizer::NFKC) 
    . ' NFKD: ' .normalizer_normalize('Pérez', Normalizer::NFKD)?>
<br/>
<?php echo 'aáàä, êëéè,' 
    . ' FORM_C: ' . normalizer_normalize('aáàä, êëéè', Normalizer::FORM_C )
    . ' FORM_D: ' .normalizer_normalize('aáàä, êëéè', Normalizer::FORM_D)
    . ' FORM_KC: ' .normalizer_normalize('aáàä, êëéè', Normalizer::FORM_KC)
    . ' FORM_KD: ' .normalizer_normalize('aáàä, êëéè', Normalizer::FORM_KD)?>

shows:

Pérez, NFC: Pérez NFD: Pérez NFKC: Pérez NFKD: Pérez
aáàä, êëéè, FORM_C: aáàä, êëéè FORM_D: aáàä, êëéè FORM_KC: aáàä, êëéè FORM_KD: aáàä, êëéè 

What is supposed normalize must do?

---EDITED---

It is stranger. When copy and paste the result from web browser, while in editor and original page I can see:

FORM_D: aáàä, êëéè

in the stackoverflow question page I can see (just in Code Sample mode):

FORM_D: aáàä, êëéè
Mcclurg answered 30/8, 2013 at 7:51 Comment(0)
M
10

Found on this page: (the linked document has different wording, the old one never exists anymore)

Unicode and internationalization is a large topic, but you should know at least one more important thing. For historical reasons, Unicode allows alternative representations of some characters. For example, á can be written either as one precomposed character á with the Unicode code point U+00E1 or as a decomposed sequence of the letter a (U+0061) combined with the accent ´ (U+0301). For purposes of comparison and sorting, two such representations should be taken as equal. To solve this, the intl library provides the Normalizer class. This class in turn provides the normalize() method, which you can use to convert a string to a normalized composed or decomposed form. Your application should consistently transform all strings to one or the other form before performing comparisons.

echo Normalizer::normalize("a´", Normalizer::FORM_C); // á  
echo Normalizer::normalize("á", Normalizer::FORM_D); // a´

So eliminating accents (and similar) is not the purpose of Normalizer.

Mcclurg answered 30/8, 2013 at 11:44 Comment(0)
N
13

Normalizer with FORM_D can split the diacritics out from the base characters, then preg_replace can eliminate the diacritics:

$string = 'áéíóú';
echo preg_replace('/[\x{0300}-\x{036f}]/u', "", Normalizer::normalize($string , Normalizer::FORM_D));
//aeiou
Nafis answered 7/4, 2019 at 6:42 Comment(1)
Very nice, I had done it before in Javascript using this approach. And it worked for PHP too.Guilbert
M
10

Found on this page: (the linked document has different wording, the old one never exists anymore)

Unicode and internationalization is a large topic, but you should know at least one more important thing. For historical reasons, Unicode allows alternative representations of some characters. For example, á can be written either as one precomposed character á with the Unicode code point U+00E1 or as a decomposed sequence of the letter a (U+0061) combined with the accent ´ (U+0301). For purposes of comparison and sorting, two such representations should be taken as equal. To solve this, the intl library provides the Normalizer class. This class in turn provides the normalize() method, which you can use to convert a string to a normalized composed or decomposed form. Your application should consistently transform all strings to one or the other form before performing comparisons.

echo Normalizer::normalize("a´", Normalizer::FORM_C); // á  
echo Normalizer::normalize("á", Normalizer::FORM_D); // a´

So eliminating accents (and similar) is not the purpose of Normalizer.

Mcclurg answered 30/8, 2013 at 11:44 Comment(0)
A
3

What you are looking for is iconv("UTF-8", "ISO-8859-1//TRANSLIT", $text).

http://php.net/manual/function.iconv.php

Be careful with LC_* settings! Depending on the setting the transliteration might change.

Armorer answered 15/6, 2017 at 13:39 Comment(2)
To remove accents is better to use: iconv("UTF-8", "ASCII//TRANSLIT", $text);Accrue
@StefanoColetta for some reason I can't get it to work. Trying on a sample string jakaśgives me jaka?Halfdan
C
1

For a function that actually removes the accents, the best that I have found so far is in the wordpress core: https://core.trac.wordpress.org/browser/trunk/src/wp-includes/formatting.php#L1127 remove_accents($string)

(Note I have filed a bug against it in order for them to take an updated version that I provided which documents each character and how it is tranlsted. so it may change in the future)

Coronado answered 14/11, 2015 at 1:2 Comment(1)
One could just copy-paste function, no matter what framework it is form. Wordpress took care to cover most of possible multi-lang cases, and in this case should be considered as valid answer to given issue.Abomasum

© 2022 - 2024 — McMap. All rights reserved.