Transliterate any convertible utf8 char into ascii equivalent

Asked 28/11, 2012 at 21:19 Answered 14/3, 2021 at 23:18

Solved php utf-8 ascii iconv transliteration

Is there any good solution out there that does this transliteration in a good manner?

I've tried using iconv(), but is very annoying and it does not behave as one might expect.

Using //TRANSLIT will try to replace what it can, leaving everything nonconvertible as "?"
Using //IGNORE will not leave "?" in text, but will also not transliterate and will also raise E_NOTICE when nonconvertible char is found, so you have to use iconv with @ error suppressor
Using //IGNORE//TRANSLIT (as some people suggested in PHP forum) is actually same as //IGNORE (tried it myself on php versions 5.3.2 and 5.3.13)
Also using //TRANSLIT//IGNORE is same as //TRANSLIT

It also uses current locale settings to transliterate.

WARNING - a lot of text and code is following!

Here are some examples:

$text = 'Regular ascii text + čćžšđ + äöüß + éĕěėëȩ + æø€ + $ + ¶ + @';
echo '<br />original: ' . $text;
echo '<br />regular: ' . iconv("UTF-8", "ASCII//TRANSLIT", $text);
//> regular: Regular ascii text + ????? + ???ss + ?????? + ae?EUR + $ + ? + @

setlocale(LC_ALL, 'en_GB');
echo '<br />en_GB: ' . iconv("UTF-8", "ASCII//TRANSLIT", $text);
//> en_GB: Regular ascii text + cczs? + aouss + eeeeee + ae?EUR + $ + ? + @

setlocale(LC_ALL, 'en_GB.UTF8'); // will this work?
echo '<br />en_GB.UTF8: ' . iconv("UTF-8", "ASCII//TRANSLIT", $text);
//> en_GB.UTF8: Regular ascii text + cczs? + aouss + eeeeee + ae?EUR + $ + ? + @

Ok, that did convert č ć š ä ö ü ß é ĕ ě ė ë ȩ and æ, but why not đ and ø?

// now specific locales
setlocale(LC_ALL, 'hr_Hr'); // this should fix croatian đ, right?
echo '<br />hr_Hr: ' . iconv("UTF-8", "ASCII//TRANSLIT", $text);
// wrong > hr_Hr: Regular ascii text + cczs? + aouss + eeeeee + ae?EUR + $ + ? + @

setlocale(LC_ALL, 'sv_SE'); // so this will fix swedish ø?
echo '<br />sv_SE: ' . iconv("UTF-8", "ASCII//TRANSLIT", $text);
// will not > sv_SE: Regular ascii text + cczs? + aouss + eeeeee + ae?EUR + $ + ? + @

//this is interesting
setlocale(LC_ALL, 'de_DE');
echo '<br />de_DE: ' . iconv("UTF-8", "ASCII//TRANSLIT", $text);
//> de_DE: Regular ascii text + cczs? + aeoeuess + eeeeee + ae?EUR + $ + ? + @
// actually this is what any german would expect since ä ö ü really is same as ae oe ue

Lets try with //IGNORE:

echo '<br />ignore: ' . iconv("UTF-8", "ASCII//IGNORE", $text);
//> ignore: Regular ascii text + + + + + $ + + @
//+ E_NOTICE: "Notice: iconv(): Detected an illegal character in input string in /var/www/test.server.web/index.php on line 49"

// with translit?
echo '<br />ignore/translit: ' . iconv("UTF-8", "ASCII//IGNORE//TRANSLIT", $text);
//same as ignore only> ignore/translit: Regular ascii text + + + + + $ + + @
//+ E_NOTICE: "Notice: iconv(): Detected an illegal character in input string in /var/www/test.server.web/index.php on line 54"

// translit/ignore?
echo '<br />translit/ignore: ' . iconv("UTF-8", "ASCII//TRANSLIT//IGNORE", $text);
//same as translit only> translit/ignore: Regular ascii text + cczs? + aouss + eeeeee + ae?EUR + $ + ? + @

Using solution of this guy also does not work as wanted: Regular ascii text + YYYYY + aous + eYYYeY + aoY + $ + � + @

Even using PECL intl Normalizer class (which is not awailable always even if you have PHP > 5.3.0, since ICU package intl uses may not be available to PHP i.e. on certain hosting servers) produces wrong result:

echo '<br />normalize: ' .preg_replace('/\p{Mn}/u', '', Normalizer::normalize($text, Normalizer::FORM_KD));
//>normalize: Regular ascii text + cczsđ + aouß + eeeeee + æø€ + $ + ¶ + @

So is there any other way of doing this right or the only proper thing to do is to do preg_replace() or str_replace() and define transliteration tables yourself?

// appendix: I have found on ZF wiki debate from 2008 about proposal for Zend_Filter_Transliterate but project was dropped since in some languages it is not possible to convert (i.e. chinese), but still for any latin- and cyrilic-based language IMO this option should exist.

Curvature answered 28/11, 2012 at 21:19 Comment(9)

Why would you convert utf8 to ascii? uft8 is the greatest thing ever... – Serosa 28/11, 2012 at 21:27

@Serosa : convert to url, to html id attribute value, matching similar words – Hensley 28/11, 2012 at 21:29

Well, for urls, you could simply use a regex replacing eveything except space, a-z, A-Z and numbers. Special characters are not a good thing for that usage. Same goes for html attributes. – Serosa 28/11, 2012 at 21:31

@Serosa čćžšđäöüøñæé.... are not special characters, and it makes a great difference of converting them to ascii then simply removing them from url string – Hensley 28/11, 2012 at 21:36

I do not really understand what you want. If I say something like "重庆大学", this is UTF-8 but not convertible to ASCII. What are you calling ASCII? Are you speaking about a specific encoding or are you speaking about bytes-per-bytes chars? -- no ok seen your comment above, you're looking for something to convert accentued chars to standard chars. – Rubin 28/11, 2012 at 21:37

@Ninsuo this is true, some utf8 chars cannot be converted to ascii equivalents like Chinese, Korean, Kambodian, etc, but for other latin- and cyrilic-based languages it is possible. – Hensley 28/11, 2012 at 21:39

@Serosa : One might be required to convert utf-8 to ascii if he is sending SMS on phones: This require you to send GSM compatible characters (lower ascii only). See en.wikipedia.org/wiki/GSM_03.38 for details. – Bolte 2/9, 2013 at 18:29

To the people asking why the OP would need this - here is a great example - in a music searching site, you might search for "fur elise" and expect to find "Für Elise" (with the umlaut on the u). We need a way to translate the diacritics and turn into a u - which is what us humans type in to search - I add a keyword field with all the fields TRANSLIT so the user can search against that. – Scala 15/10, 2015 at 21:32

Of course chinese kan also be converted - just translitereated. – Assign 30/5, 2016 at 12:43

The toAscii() function of Patchwork\Utf8 does exactly this, see:

https://github.com/nicolas-grekas/Patchwork-UTF8/blob/master/src/Patchwork/Utf8.php

It leverages iconv and intl's Normalizer to remove accents, split ligatures and do many other generic transliterations.

Kearse answered 14/11, 2013 at 15:48 Comment(6)

Done some initial testing and it seems ok. Nice job there! – Hensley 14/11, 2013 at 23:21

How to use this class? – Habsburg 5/11, 2014 at 16:46

@bornie see the documentation on his github repo at github.com/nicolas-grekas/Patchwork-UTF8 . IMHO, this lib should be built right into PHP, if it hasn't already been! The fact it has over 2.7 million installs should tell you, for crying out loud. – Cutoff 29/12, 2014 at 18:40

Broken link. Could someone fix? – Scala 15/10, 2015 at 21:33

patchwork/utf8 is archived. Any good alternatives in 2021? Particulary interested in toAscii() method. – Horrified 2/3, 2021 at 5:4

Just found transliterator_transliterate('Any-Latin; Latin-ASCII;', $string) to be working for my use. – Horrified 2/3, 2021 at 5:12

From this website, I found something that might help you :

function removeAccents($str)
{
  $a = array('À', 'Á', 'Â', 'Ã', 'Ä', 'Å', 'Æ', 'Ç', 'È', 'É', 'Ê', 'Ë', 'Ì', 'Í', 'Î', 'Ï', 'Ð', 'Ñ', 'Ò', 'Ó', 'Ô', 'Õ', 'Ö', 'Ø', 'Ù', 'Ú', 'Û', 'Ü', 'Ý', 'ß', 'à', 'á', 'â', 'ã', 'ä', 'å', 'æ', 'ç', 'è', 'é', 'ê', 'ë', 'ì', 'í', 'î', 'ï', 'ñ', 'ò', 'ó', 'ô', 'õ', 'ö', 'ø', 'ù', 'ú', 'û', 'ü', 'ý', 'ÿ', 'Ā', 'ā', 'Ă', 'ă', 'Ą', 'ą', 'Ć', 'ć', 'Ĉ', 'ĉ', 'Ċ', 'ċ', 'Č', 'č', 'Ď', 'ď', 'Đ', 'đ', 'Ē', 'ē', 'Ĕ', 'ĕ', 'Ė', 'ė', 'Ę', 'ę', 'Ě', 'ě', 'Ĝ', 'ĝ', 'Ğ', 'ğ', 'Ġ', 'ġ', 'Ģ', 'ģ', 'Ĥ', 'ĥ', 'Ħ', 'ħ', 'Ĩ', 'ĩ', 'Ī', 'ī', 'Ĭ', 'ĭ', 'Į', 'į', 'İ', 'ı', 'Ĳ', 'ĳ', 'Ĵ', 'ĵ', 'Ķ', 'ķ', 'Ĺ', 'ĺ', 'Ļ', 'ļ', 'Ľ', 'ľ', 'Ŀ', 'ŀ', 'Ł', 'ł', 'Ń', 'ń', 'Ņ', 'ņ', 'Ň', 'ň', 'ŉ', 'Ō', 'ō', 'Ŏ', 'ŏ', 'Ő', 'ő', 'Œ', 'œ', 'Ŕ', 'ŕ', 'Ŗ', 'ŗ', 'Ř', 'ř', 'Ś', 'ś', 'Ŝ', 'ŝ', 'Ş', 'ş', 'Š', 'š', 'Ţ', 'ţ', 'Ť', 'ť', 'Ŧ', 'ŧ', 'Ũ', 'ũ', 'Ū', 'ū', 'Ŭ', 'ŭ', 'Ů', 'ů', 'Ű', 'ű', 'Ų', 'ų', 'Ŵ', 'ŵ', 'Ŷ', 'ŷ', 'Ÿ', 'Ź', 'ź', 'Ż', 'ż', 'Ž', 'ž', 'ſ', 'ƒ', 'Ơ', 'ơ', 'Ư', 'ư', 'Ǎ', 'ǎ', 'Ǐ', 'ǐ', 'Ǒ', 'ǒ', 'Ǔ', 'ǔ', 'Ǖ', 'ǖ', 'Ǘ', 'ǘ', 'Ǚ', 'ǚ', 'Ǜ', 'ǜ', 'Ǻ', 'ǻ', 'Ǽ', 'ǽ', 'Ǿ', 'ǿ');
  $b = array('A', 'A', 'A', 'A', 'A', 'A', 'AE', 'C', 'E', 'E', 'E', 'E', 'I', 'I', 'I', 'I', 'D', 'N', 'O', 'O', 'O', 'O', 'O', 'O', 'U', 'U', 'U', 'U', 'Y', 's', 'a', 'a', 'a', 'a', 'a', 'a', 'ae', 'c', 'e', 'e', 'e', 'e', 'i', 'i', 'i', 'i', 'n', 'o', 'o', 'o', 'o', 'o', 'o', 'u', 'u', 'u', 'u', 'y', 'y', 'A', 'a', 'A', 'a', 'A', 'a', 'C', 'c', 'C', 'c', 'C', 'c', 'C', 'c', 'D', 'd', 'D', 'd', 'E', 'e', 'E', 'e', 'E', 'e', 'E', 'e', 'E', 'e', 'G', 'g', 'G', 'g', 'G', 'g', 'G', 'g', 'H', 'h', 'H', 'h', 'I', 'i', 'I', 'i', 'I', 'i', 'I', 'i', 'I', 'i', 'IJ', 'ij', 'J', 'j', 'K', 'k', 'L', 'l', 'L', 'l', 'L', 'l', 'L', 'l', 'l', 'l', 'N', 'n', 'N', 'n', 'N', 'n', 'n', 'O', 'o', 'O', 'o', 'O', 'o', 'OE', 'oe', 'R', 'r', 'R', 'r', 'R', 'r', 'S', 's', 'S', 's', 'S', 's', 'S', 's', 'T', 't', 'T', 't', 'T', 't', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'W', 'w', 'Y', 'y', 'Y', 'Z', 'z', 'Z', 'z', 'Z', 'z', 's', 'f', 'O', 'o', 'U', 'u', 'A', 'a', 'I', 'i', 'O', 'o', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'A', 'a', 'AE', 'ae', 'O', 'o');
  return str_replace($a, $b, $str);
}

Usage example :

$text = 'Regular ascii text + čćžšđ + äöüß + éĕěėëȩ + æø€ + $ + ¶ + @';
echo removeAccents($text);

Displays :

Regular ascii text + cczsd + aous + eeeeeȩ + aeo€ + $ + ¶ + @

You'll need to improve it, but you get the idea... If there is a direct way to do such a work, I don't know it.

Rubin answered 28/11, 2012 at 21:41 Comment(3)

yes, I already suggested this solution as a "last-solution", but it is not what I am hoping for. +1 for trying though – Hensley 28/11, 2012 at 21:53

This works perfect! iconv's TRANSLIT is missing many characters. – Dierdredieresis 15/4, 2016 at 12:16

I came up with a similar solution but with strtr instead of str_replace which is slightly faster in some cases eg longer strings. – Spondee 15/5, 2016 at 7:28

I think setting the right locale is the way to go. Be aware, that the specific locale must also be available on the system, check it using locale -a. If you only have de_DE.utf8 - also you have to use set_locale(de_DE.utf8)

Assign answered 30/5, 2016 at 12:43 Comment(0)

As none of the solutions above worked for me (I needed to transliterate many European character sets to ASCII), I finally found this old PECL package which just seemed to work http://derickrethans.nl/projects.html#translit . I had problems especially with cyrillic character sets, and this seems to handle them perfectly.

Plovdiv answered 6/8, 2014 at 11:53 Comment(2)

Please open an own question if your answer is not specific to the initial question above. – Only 6/8, 2014 at 12:1

@Only This isn't a question but a suggestion – Arsenal 21/6, 2017 at 9:16

If I have understood you correctly, I may have an answer for you: I've written a basic PHP class that allows you to convert most characters into their ASCII equivalents.

Below is a screenshot of its output converting various composer names with accents in their name.

You can fork it from github here https://github.com/LukeMadhanga/transliterator.

NB: It is as of yet undocumented but it should be p*** easy to get to grips with.

Sutlej answered 11/9, 2015 at 15:49 Comment(0)

-1

I wrote this https://github.com/marekkowalczyk/sanitize in Golang; works well out of the box and is easy to improve.

Exception answered 14/3, 2021 at 23:18 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags