How to remove accents and turn letters into "plain" ASCII characters? [duplicate]
Asked Answered
V

5

48

What is the most efficient way to remove accents from a string e.g. ÈâuÑ becomes Eaun?

Is there a simple, built in way that I'm missing or a regular expression?

Vitia answered 22/8, 2010 at 18:21 Comment(2)
@Peeps: telling users to search google is against Stack Overflow's etiquette. If the question doesn't exist on the website it's better for everyone if it is asked, even if the OP already knows the answer, since it will increase our number of non-duplicate questions. So maybe next time if someone searches it with google they will find this very question, and we will have one more user.Siobhansion
@Andreas good point. However, this is most certainly a SO duplicate, so Peeps kind of has a small point :) I'm too lazy to search for it right now, though.Lepper
H
57

If you have iconv installed, try this (the example assumes your input string is in UTF-8):

echo iconv('UTF-8', 'ASCII//TRANSLIT', $string);

(iconv is a library to convert between all kinds of encodings; it's efficient and included with many PHP distributions by default. Most of all, it's definitely easier and more error-proof than trying to roll your own solution (did you know that there's a "Latin letter N with a curl"? Me neither.))

Herby answered 22/8, 2010 at 18:27 Comment(9)
+1 Beat me to it. This should work best. However, note that this tends to fail if there are invalid characters in the input (using ASCII//TRANSLIT//IGNORE should help) and as so often, if encountering problems, the User Contributed Notes are a good read. php.net/manual/en/function.iconv.phpLepper
For some reason, sometimes I can't get this to work. See codepad.viper-7.com/SUufA4 But in another machine, I got "`E^au~N". Not was desired, though.Espinal
Nice, simple and small and works...for meVitia
This inconv has some conflicts so I will ask a similar questionVitia
This did not work for me at first. Accent characters just became ? characters. As per a comment on iconv() on the PHP manual page, I first ran: setlocale(LC_ALL,'en_CA.utf8'); and then everything worked perfectly. The 'en_CA.utf8' was the default locale on my system. Try "locale -a" to see a list of available localesEtymon
This icon() solution works for many characters, but not all. For example, "Colbjørnsensgade" becomes "Colbj?rnsensgade". That's why the transliterator_transliterate() solution by SimonSimCity is usually a better choice (but requires the right libraries installed to work).Infiltration
This doesn't work for all russian charactersTerbecki
This fixed the question marks for me. setlocale(LC_ALL, "en_US.utf8"); $string = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $string);Desman
Just another upvote for these comments here. I spent a few hours today trying to debug why my ASCiI//TRANSLIT//IGNORE code wasn't working on a German a umlaut. My development platform worked fine. The live server failed. After trying a thousand things, the setlocale worked fine - added to to both.Accordant
P
57

I found a solution, that worked in all my test-cases (copied from http://php.net/manual/en/transliterator.transliterate.php):

var_dump(transliterator_transliterate('Any-Latin; Latin-ASCII; [\u0080-\u7fff] remove',
    "A æ Übérmensch på høyeste nivå! И я люблю PHP! есть. fi ¦"));
// string(50) "A ae Ubermensch pa hoyeste niva! I a lublu PHP! est. fi "

see: http://www.php.net/normalizer

EDIT: This solution is independent of the locale set using setlocale(). Another benefit over iconv() is, that even non-latin characters are not ignored.

EDIT2: I discovered, that there are some characters, that are not covered by the transliteration I posted originally. Any-Latin translates the cyrillic character ь to a character, that doesn't fit into a latin character-set: ʹ (http://en.wikipedia.org/wiki/Prime_%28symbol%29). I've added [\u0100-\u7fff] remove to remove all these non-latin characters. I also added a test to the text ;)

I suggest, that they mean the latin alphabet and not one of the latin character-sets by Latin here. But anyways - in my opinion, they should transliterate it to something ASCII then in Latin-ASCII ...

EDIT3: Sorry for another change here. I had to take the characters down to u0080 instead of u0100, to get only ASCII characters as output. The test above is updated.

Problem answered 15/4, 2013 at 18:40 Comment(4)
Note: it needs php_intl.dll extension enabledFarnese
I agree, this was the best function for me too! (and I tried many)Merman
Really good solution, very easy to use and most useful that others solutions using str_replace.Holman
Should be noted that this will not just transliterate the text (as OP asked), but will remove some chracters too. eg € (euro sign) will be removed. Just pass 'Any-Latin; Latin-ASCII;' as the first param to keep those. Optionally, you can then use iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $str) to transform "€" to "EUR".Underdrawers
I
19

Reposting this on request of @palantir ...

I find iconv completely unreliable, and I dislike preg_replace solutions and big arrays ... so my favorite way (and the only reliable method I've found) is ...

function toASCII( $str )
{
    return strtr(utf8_decode($str), 
        utf8_decode(
        'ŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ'),
        'SOZsozYYuAAAAAAACEEEEIIIIDNOOOOOOUUUUYsaaaaaaaceeeeiiiionoooooouuuuyy');
}
Innovation answered 28/7, 2011 at 10:51 Comment(4)
you should also put in the following letters: ő, Ő, ű, Ű. Thanks. :)Colorblind
This is not reliable method. Not working for polish accented chars like ŻŹĆŃĄŚŁĘÓżźćńąśłęó. Try var_dump(strtr(utf8_decode('qqqqŻŹĆŃĄŚŁĘÓżźćńąśłęóqqq'), utf8_decode('ŻŹĆŃĄŚŁĘÓżźćńąśłęó'),'ZZCNASLEOzzcnasleo')); I got string(25) "qqqqeeeeeeeeOeeeeeeeeoqqq". Iconv is more reliable var_dump(iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', 'qqqqŻŹĆŃĄŚŁĘÓżźćńąśłęóqqq')); and I get string(25) "qqqqZZCNASLEOzzcnasleoqqq"Detrusion
converts 'Горловка' for me to YYYYYYYY , not goodCarrier
It's not the best in terms of performance and it also produces incorrect result. Letters like Œ, Æ, etc. should decompose to two letters, not to one.Ehtelehud
S
13

You can use iconv to transliterate the characters to plain US-ASCII and then use a regular expression to remove non-alphabetic characters:

preg_replace('/[^a-z]/i', '', iconv("UTF-8", "US-ASCII//TRANSLIT", $text))

Another way would be using the Normalizer to normalize to the Normalization Form KD (NFKD) and then remove the mark characters:

preg_replace('/\p{Mn}/u', '', Normalizer::normalize($text, Normalizer::FORM_KD))
Seena answered 22/8, 2010 at 18:28 Comment(11)
ISO-8859-1? Are you sure? Won't this leave at least ÄÖÜ in place (as their 8859-1 counterparts)?Lepper
What’s the reason for the down vote?Seena
Downvote isn't mine. However, the OP is not asking to remove non-alphabetic characters, is he?Lepper
It was mine. Reverted now that you fixed it.Espinal
@Pekka: The transliteration of ÈâuÑ using iconv gives `E^au~N. That’s why the following cleanup is used.Seena
@Seena I see. I'm sorry, we have had this discussion in a duplicate somewhere already :) +1 for the most complete solution, then, that should be made the accepted one. Update: If I had any votes leftLepper
By the way, what you say and your code don't match once again. FORM_D makes more sense.Espinal
@Artefacto: Thanks for the remark; fixed it. And take a look at figure 6 in unicode.org/reports/tr15/#Norm_Forms.Seena
@Seena OK, I guess it's a matter of preference, though strictly that normalization won't take care only of the marks. See also the other question of the OP. I took some, erm, inspiration from you (basically only replaced the [a-z] regex you then had with \p{M} and left Normalizer::FORM_D.Espinal
The normalize function works for me.Eadith
@Seena your Normalizer solution is quite good but in my case two characters Ł and ł are left untouched. My code: var_dump(preg_replace('/\p{Mn}/u', '',Normalizer::normalize('qqqqŻŹĆŃĄŚŁĘÓżźćńąśłęóqqq', Normalizer::FORM_KD))); and I get back: string(27) "qqqqZZCNASŁEOzzcnasłeoqqq". iconv works best for me.Detrusion
G
12

Note: I'm reposting this from another similar question in the hope that it's helpful to others.

I ended up writing a PHP library based on URLify.js from the Django project, since I found iconv() to be too incomplete. You can find it here:

https://github.com/jbroadway/urlify

Handles Latin characters as well as Greek, Turkish, Russian, Ukrainian, Czech, Polish, and Latvian.

Goosegog answered 1/5, 2012 at 22:47 Comment(2)
This class worked on all my testcases where all iconv-based solutions failed for me. Thanks!Selfabasement
Thank you for this class. In 2017 the project is still alive and the class is working perfectly in PHP7Striction

© 2022 - 2024 — McMap. All rights reserved.