How to remove accents and turn letters into "plain" ASCII characters? [duplicate]

Asked 22/8, 2010 at 18:21 Answered 15/4, 2013 at 18:40

What is the most efficient way to remove accents from a string e.g. ÈâuÑ becomes Eaun?

Is there a simple, built in way that I'm missing or a regular expression?

Vitia answered 22/8, 2010 at 18:21 Comment(2)

@Peeps: telling users to search google is against Stack Overflow's etiquette. If the question doesn't exist on the website it's better for everyone if it is asked, even if the OP already knows the answer, since it will increase our number of non-duplicate questions. So maybe next time if someone searches it with google they will find this very question, and we will have one more user. – Siobhansion 22/8, 2010 at 18:30

@Andreas good point. However, this is most certainly a SO duplicate, so Peeps kind of has a small point :) I'm too lazy to search for it right now, though. – Lepper 22/8, 2010 at 18:33

If you have iconv installed, try this (the example assumes your input string is in UTF-8):

echo iconv('UTF-8', 'ASCII//TRANSLIT', $string);

(iconv is a library to convert between all kinds of encodings; it's efficient and included with many PHP distributions by default. Most of all, it's definitely easier and more error-proof than trying to roll your own solution (did you know that there's a "Latin letter N with a curl"? Me neither.))

Herby answered 22/8, 2010 at 18:27 Comment(9)

+1 Beat me to it. This should work best. However, note that this tends to fail if there are invalid characters in the input (using ASCII//TRANSLIT//IGNORE should help) and as so often, if encountering problems, the User Contributed Notes are a good read. php.net/manual/en/function.iconv.php – Lepper 22/8, 2010 at 18:28

For some reason, sometimes I can't get this to work. See codepad.viper-7.com/SUufA4 But in another machine, I got "`E^au~N". Not was desired, though. – Espinal 22/8, 2010 at 18:38

Nice, simple and small and works...for me – Vitia 22/8, 2010 at 18:38

This inconv has some conflicts so I will ask a similar question – Vitia 22/8, 2010 at 18:40

This did not work for me at first. Accent characters just became ? characters. As per a comment on iconv() on the PHP manual page, I first ran: setlocale(LC_ALL,'en_CA.utf8'); and then everything worked perfectly. The 'en_CA.utf8' was the default locale on my system. Try "locale -a" to see a list of available locales – Etymon 23/2, 2013 at 1:49

This icon() solution works for many characters, but not all. For example, "Colbjørnsensgade" becomes "Colbj?rnsensgade". That's why the transliterator_transliterate() solution by SimonSimCity is usually a better choice (but requires the right libraries installed to work). – Infiltration 6/6, 2015 at 5:47

This doesn't work for all russian characters – Terbecki 31/7, 2015 at 9:34

This fixed the question marks for me. setlocale(LC_ALL, "en_US.utf8"); $string = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $string); – Desman 25/6, 2016 at 23:18

Just another upvote for these comments here. I spent a few hours today trying to debug why my ASCiI//TRANSLIT//IGNORE code wasn't working on a German a umlaut. My development platform worked fine. The live server failed. After trying a thousand things, the setlocale worked fine - added to to both. – Accordant 6/6, 2017 at 14:6

I found a solution, that worked in all my test-cases (copied from http://php.net/manual/en/transliterator.transliterate.php):

var_dump(transliterator_transliterate('Any-Latin; Latin-ASCII; [\u0080-\u7fff] remove',
    "A æ Übérmensch på høyeste nivå! И я люблю PHP! есть. ﬁ ¦"));
// string(50) "A ae Ubermensch pa hoyeste niva! I a lublu PHP! est. fi "

see: http://www.php.net/normalizer

EDIT: This solution is independent of the locale set using setlocale(). Another benefit over iconv() is, that even non-latin characters are not ignored.

EDIT2: I discovered, that there are some characters, that are not covered by the transliteration I posted originally. Any-Latin translates the cyrillic character ь to a character, that doesn't fit into a latin character-set: ʹ (http://en.wikipedia.org/wiki/Prime_%28symbol%29). I've added [\u0100-\u7fff] remove to remove all these non-latin characters. I also added a test to the text ;)

I suggest, that they mean the latin alphabet and not one of the latin character-sets by Latin here. But anyways - in my opinion, they should transliterate it to something ASCII then in Latin-ASCII ...

EDIT3: Sorry for another change here. I had to take the characters down to u0080 instead of u0100, to get only ASCII characters as output. The test above is updated.

Problem answered 15/4, 2013 at 18:40 Comment(4)

Note: it needs php_intl.dll extension enabled – Farnese 29/8, 2013 at 18:42

I agree, this was the best function for me too! (and I tried many) – Merman 7/1, 2014 at 0:3

Really good solution, very easy to use and most useful that others solutions using str_replace. – Holman 18/8, 2014 at 12:22

Should be noted that this will not just transliterate the text (as OP asked), but will remove some chracters too. eg € (euro sign) will be removed. Just pass 'Any-Latin; Latin-ASCII;' as the first param to keep those. Optionally, you can then use iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $str) to transform "€" to "EUR". – Underdrawers 4/2, 2015 at 15:33

Reposting this on request of @palantir ...

I find iconv completely unreliable, and I dislike preg_replace solutions and big arrays ... so my favorite way (and the only reliable method I've found) is ...

function toASCII( $str )
{
    return strtr(utf8_decode($str), 
        utf8_decode(
        'ŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ'),
        'SOZsozYYuAAAAAAACEEEEIIIIDNOOOOOOUUUUYsaaaaaaaceeeeiiiionoooooouuuuyy');
}

Innovation answered 28/7, 2011 at 10:51 Comment(4)

you should also put in the following letters: ő, Ő, ű, Ű. Thanks. :) – Colorblind 29/11, 2012 at 11:21

This is not reliable method. Not working for polish accented chars like ŻŹĆŃĄŚŁĘÓżźćńąśłęó. Try

var_dump(strtr(utf8_decode('qqqqŻŹĆŃĄŚŁĘÓżźćńąśłęóqqq'), utf8_decode('ŻŹĆŃĄŚŁĘÓżźćńąśłęó'),'ZZCNASLEOzzcnasleo'));

I got string(25) "qqqqeeeeeeeeOeeeeeeeeoqqq". Iconv is more reliable var_dump(iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', 'qqqqŻŹĆŃĄŚŁĘÓżźćńąśłęóqqq')); and I get string(25) "qqqqZZCNASLEOzzcnasleoqqq" – Detrusion 18/5, 2013 at 21:22

converts 'Горловка' for me to YYYYYYYY , not good – Carrier 28/10, 2016 at 7:24

It's not the best in terms of performance and it also produces incorrect result. Letters like Œ, Æ, etc. should decompose to two letters, not to one. – Ehtelehud 22/5, 2019 at 8:35

You can use iconv to transliterate the characters to plain US-ASCII and then use a regular expression to remove non-alphabetic characters:

preg_replace('/[^a-z]/i', '', iconv("UTF-8", "US-ASCII//TRANSLIT", $text))

Another way would be using the Normalizer to normalize to the Normalization Form KD (NFKD) and then remove the mark characters:

preg_replace('/\p{Mn}/u', '', Normalizer::normalize($text, Normalizer::FORM_KD))

Seena answered 22/8, 2010 at 18:28 Comment(11)

ISO-8859-1? Are you sure? Won't this leave at least ÄÖÜ in place (as their 8859-1 counterparts)? – Lepper 22/8, 2010 at 18:32

What’s the reason for the down vote? – Seena 22/8, 2010 at 18:32

Downvote isn't mine. However, the OP is not asking to remove non-alphabetic characters, is he? – Lepper 22/8, 2010 at 18:34

It was mine. Reverted now that you fixed it. – Espinal 22/8, 2010 at 18:35

@Pekka: The transliteration of ÈâuÑ using iconv gives `E^au~N. That’s why the following cleanup is used. – Seena 22/8, 2010 at 18:39

@Seena I see. I'm sorry, we have had this discussion in a duplicate somewhere already :) +1 for the most complete solution, then, that should be made the accepted one. Update: If I had any votes left – Lepper 22/8, 2010 at 18:40

By the way, what you say and your code don't match once again. FORM_D makes more sense. – Espinal 22/8, 2010 at 18:47

@Artefacto: Thanks for the remark; fixed it. And take a look at figure 6 in unicode.org/reports/tr15/#Norm_Forms. – Seena 22/8, 2010 at 18:52

@Seena OK, I guess it's a matter of preference, though strictly that normalization won't take care only of the marks. See also the other question of the OP. I took some, erm, inspiration from you (basically only replaced the [a-z] regex you then had with \p{M} and left Normalizer::FORM_D. – Espinal 22/8, 2010 at 19:9

The normalize function works for me. – Eadith 17/12, 2012 at 15:7

@Seena your Normalizer solution is quite good but in my case two characters Ł and ł are left untouched. My code:

var_dump(preg_replace('/\p{Mn}/u', '',Normalizer::normalize('qqqqŻŹĆŃĄŚŁĘÓżźćńąśłęóqqq', Normalizer::FORM_KD)));

and I get back: string(27) "qqqqZZCNASŁEOzzcnasłeoqqq". iconv works best for me. – Detrusion 18/5, 2013 at 21:34

Note: I'm reposting this from another similar question in the hope that it's helpful to others.

I ended up writing a PHP library based on URLify.js from the Django project, since I found iconv() to be too incomplete. You can find it here:

https://github.com/jbroadway/urlify

Handles Latin characters as well as Greek, Turkish, Russian, Ukrainian, Czech, Polish, and Latvian.

Goosegog answered 1/5, 2012 at 22:47 Comment(2)

This class worked on all my testcases where all iconv-based solutions failed for me. Thanks! – Selfabasement 3/4, 2013 at 8:15

Thank you for this class. In 2017 the project is still alive and the class is working perfectly in PHP7 – Striction 23/8, 2017 at 16:44

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags