What is the most efficient way to remove accents from a string e.g. ÈâuÑ
becomes Eaun
?
Is there a simple, built in way that I'm missing or a regular expression?
What is the most efficient way to remove accents from a string e.g. ÈâuÑ
becomes Eaun
?
Is there a simple, built in way that I'm missing or a regular expression?
If you have iconv installed, try this (the example assumes your input string is in UTF-8):
echo iconv('UTF-8', 'ASCII//TRANSLIT', $string);
(iconv is a library to convert between all kinds of encodings; it's efficient and included with many PHP distributions by default. Most of all, it's definitely easier and more error-proof than trying to roll your own solution (did you know that there's a "Latin letter N with a curl"? Me neither.))
ASCII//TRANSLIT//IGNORE
should help) and as so often, if encountering problems, the User Contributed Notes are a good read. php.net/manual/en/function.iconv.php –
Lepper setlocale(LC_ALL, "en_US.utf8"); $string = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $string);
–
Desman I found a solution, that worked in all my test-cases (copied from http://php.net/manual/en/transliterator.transliterate.php):
var_dump(transliterator_transliterate('Any-Latin; Latin-ASCII; [\u0080-\u7fff] remove',
"A æ Übérmensch på høyeste nivå! И я люблю PHP! есть. fi ¦"));
// string(50) "A ae Ubermensch pa hoyeste niva! I a lublu PHP! est. fi "
see: http://www.php.net/normalizer
EDIT: This solution is independent of the locale set using setlocale(). Another benefit over iconv() is, that even non-latin characters are not ignored.
EDIT2: I discovered, that there are some characters, that are not covered by the transliteration I posted originally. Any-Latin
translates the cyrillic character ь
to a character, that doesn't fit into a latin character-set: ʹ
(http://en.wikipedia.org/wiki/Prime_%28symbol%29). I've added [\u0100-\u7fff] remove
to remove all these non-latin characters. I also added a test to the text ;)
I suggest, that they mean the latin alphabet and not one of the latin character-sets by Latin
here. But anyways - in my opinion, they should transliterate it to something ASCII then in Latin-ASCII
...
EDIT3: Sorry for another change here. I had to take the characters down to u0080 instead of u0100, to get only ASCII characters as output. The test above is updated.
php_intl.dll
extension enabled –
Farnese Reposting this on request of @palantir ...
I find iconv completely unreliable, and I dislike preg_replace solutions and big arrays ... so my favorite way (and the only reliable method I've found) is ...
function toASCII( $str )
{
return strtr(utf8_decode($str),
utf8_decode(
'ŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ'),
'SOZsozYYuAAAAAAACEEEEIIIIDNOOOOOOUUUUYsaaaaaaaceeeeiiiionoooooouuuuyy');
}
ő
, Ő
, ű
, Ű
. Thanks. :) –
Colorblind ŻŹĆŃĄŚŁĘÓżźćńąśłęó
. Try var_dump(strtr(utf8_decode('qqqqŻŹĆŃĄŚŁĘÓżźćńąśłęóqqq'), utf8_decode('ŻŹĆŃĄŚŁĘÓżźćńąśłęó'),'ZZCNASLEOzzcnasleo'));
I got string(25) "qqqqeeeeeeeeOeeeeeeeeoqqq"
. Iconv is more reliable var_dump(iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', 'qqqqŻŹĆŃĄŚŁĘÓżźćńąśłęóqqq'));
and I get string(25) "qqqqZZCNASLEOzzcnasleoqqq"
–
Detrusion You can use iconv
to transliterate the characters to plain US-ASCII and then use a regular expression to remove non-alphabetic characters:
preg_replace('/[^a-z]/i', '', iconv("UTF-8", "US-ASCII//TRANSLIT", $text))
Another way would be using the Normalizer to normalize to the Normalization Form KD (NFKD) and then remove the mark characters:
preg_replace('/\p{Mn}/u', '', Normalizer::normalize($text, Normalizer::FORM_KD))
ISO-8859-1
? Are you sure? Won't this leave at least ÄÖÜ in place (as their 8859-1 counterparts)? –
Lepper ÈâuÑ
using iconv
gives `E^au~N
. That’s why the following cleanup is used. –
Seena Normalizer
solution is quite good but in my case two characters Ł
and ł
are left untouched. My code: var_dump(preg_replace('/\p{Mn}/u', '',Normalizer::normalize('qqqqŻŹĆŃĄŚŁĘÓżźćńąśłęóqqq', Normalizer::FORM_KD)));
and I get back: string(27) "qqqqZZCNASŁEOzzcnasłeoqqq"
. iconv
works best for me. –
Detrusion Note: I'm reposting this from another similar question in the hope that it's helpful to others.
I ended up writing a PHP library based on URLify.js from the Django project, since I found iconv() to be too incomplete. You can find it here:
https://github.com/jbroadway/urlify
Handles Latin characters as well as Greek, Turkish, Russian, Ukrainian, Czech, Polish, and Latvian.
© 2022 - 2024 — McMap. All rights reserved.