Transliterate any convertible utf8 char into ascii equivalent
Asked Answered
C

6

25

Is there any good solution out there that does this transliteration in a good manner?

I've tried using iconv(), but is very annoying and it does not behave as one might expect.

  • Using //TRANSLIT will try to replace what it can, leaving everything nonconvertible as "?"
  • Using //IGNORE will not leave "?" in text, but will also not transliterate and will also raise E_NOTICE when nonconvertible char is found, so you have to use iconv with @ error suppressor
  • Using //IGNORE//TRANSLIT (as some people suggested in PHP forum) is actually same as //IGNORE (tried it myself on php versions 5.3.2 and 5.3.13)
  • Also using //TRANSLIT//IGNORE is same as //TRANSLIT

It also uses current locale settings to transliterate.

WARNING - a lot of text and code is following!

Here are some examples:

$text = 'Regular ascii text + čćžšđ + äöüß + éĕěėëȩ + æø€ + $ + ¶ + @';
echo '<br />original: ' . $text;
echo '<br />regular: ' . iconv("UTF-8", "ASCII//TRANSLIT", $text);
//> regular: Regular ascii text + ????? + ???ss + ?????? + ae?EUR + $ + ? + @

setlocale(LC_ALL, 'en_GB');
echo '<br />en_GB: ' . iconv("UTF-8", "ASCII//TRANSLIT", $text);
//> en_GB: Regular ascii text + cczs? + aouss + eeeeee + ae?EUR + $ + ? + @

setlocale(LC_ALL, 'en_GB.UTF8'); // will this work?
echo '<br />en_GB.UTF8: ' . iconv("UTF-8", "ASCII//TRANSLIT", $text);
//> en_GB.UTF8: Regular ascii text + cczs? + aouss + eeeeee + ae?EUR + $ + ? + @

Ok, that did convert č ć š ä ö ü ß é ĕ ě ė ë ȩ and æ, but why not đ and ø?

// now specific locales
setlocale(LC_ALL, 'hr_Hr'); // this should fix croatian đ, right?
echo '<br />hr_Hr: ' . iconv("UTF-8", "ASCII//TRANSLIT", $text);
// wrong > hr_Hr: Regular ascii text + cczs? + aouss + eeeeee + ae?EUR + $ + ? + @

setlocale(LC_ALL, 'sv_SE'); // so this will fix swedish ø?
echo '<br />sv_SE: ' . iconv("UTF-8", "ASCII//TRANSLIT", $text);
// will not > sv_SE: Regular ascii text + cczs? + aouss + eeeeee + ae?EUR + $ + ? + @

//this is interesting
setlocale(LC_ALL, 'de_DE');
echo '<br />de_DE: ' . iconv("UTF-8", "ASCII//TRANSLIT", $text);
//> de_DE: Regular ascii text + cczs? + aeoeuess + eeeeee + ae?EUR + $ + ? + @
// actually this is what any german would expect since ä ö ü really is same as ae oe ue

Lets try with //IGNORE:

echo '<br />ignore: ' . iconv("UTF-8", "ASCII//IGNORE", $text);
//> ignore: Regular ascii text + + + + + $ + + @
//+ E_NOTICE: "Notice: iconv(): Detected an illegal character in input string in /var/www/test.server.web/index.php on line 49"

// with translit?
echo '<br />ignore/translit: ' . iconv("UTF-8", "ASCII//IGNORE//TRANSLIT", $text);
//same as ignore only> ignore/translit: Regular ascii text + + + + + $ + + @
//+ E_NOTICE: "Notice: iconv(): Detected an illegal character in input string in /var/www/test.server.web/index.php on line 54"

// translit/ignore?
echo '<br />translit/ignore: ' . iconv("UTF-8", "ASCII//TRANSLIT//IGNORE", $text);
//same as translit only> translit/ignore: Regular ascii text + cczs? + aouss + eeeeee + ae?EUR + $ + ? + @

Using solution of this guy also does not work as wanted: Regular ascii text + YYYYY + aous + eYYYeY + aoY + $ + � + @

Even using PECL intl Normalizer class (which is not awailable always even if you have PHP > 5.3.0, since ICU package intl uses may not be available to PHP i.e. on certain hosting servers) produces wrong result:

echo '<br />normalize: ' .preg_replace('/\p{Mn}/u', '', Normalizer::normalize($text, Normalizer::FORM_KD));
//>normalize: Regular ascii text + cczsđ + aouß + eeeeee + æø€ + $ + ¶ + @

So is there any other way of doing this right or the only proper thing to do is to do preg_replace() or str_replace() and define transliteration tables yourself?

// appendix: I have found on ZF wiki debate from 2008 about proposal for Zend_Filter_Transliterate but project was dropped since in some languages it is not possible to convert (i.e. chinese), but still for any latin- and cyrilic-based language IMO this option should exist.

Curvature answered 28/11, 2012 at 21:19 Comment(9)
Why would you convert utf8 to ascii? uft8 is the greatest thing ever...Serosa
@Serosa : convert to url, to html id attribute value, matching similar wordsHensley
Well, for urls, you could simply use a regex replacing eveything except space, a-z, A-Z and numbers. Special characters are not a good thing for that usage. Same goes for html attributes.Serosa
@Serosa čćžšđäöüøñæé.... are not special characters, and it makes a great difference of converting them to ascii then simply removing them from url stringHensley
I do not really understand what you want. If I say something like "重庆大学", this is UTF-8 but not convertible to ASCII. What are you calling ASCII? Are you speaking about a specific encoding or are you speaking about bytes-per-bytes chars? -- no ok seen your comment above, you're looking for something to convert accentued chars to standard chars.Rubin
@Ninsuo this is true, some utf8 chars cannot be converted to ascii equivalents like Chinese, Korean, Kambodian, etc, but for other latin- and cyrilic-based languages it is possible.Hensley
@Serosa : One might be required to convert utf-8 to ascii if he is sending SMS on phones: This require you to send GSM compatible characters (lower ascii only). See en.wikipedia.org/wiki/GSM_03.38 for details.Bolte
To the people asking why the OP would need this - here is a great example - in a music searching site, you might search for "fur elise" and expect to find "Für Elise" (with the umlaut on the u). We need a way to translate the diacritics and turn into a u - which is what us humans type in to search - I add a keyword field with all the fields TRANSLIT so the user can search against that.Scala
Of course chinese kan also be converted - just translitereated.Assign
K
12

The toAscii() function of Patchwork\Utf8 does exactly this, see:

https://github.com/nicolas-grekas/Patchwork-UTF8/blob/master/src/Patchwork/Utf8.php

It leverages iconv and intl's Normalizer to remove accents, split ligatures and do many other generic transliterations.

Kearse answered 14/11, 2013 at 15:48 Comment(6)
Done some initial testing and it seems ok. Nice job there!Hensley
How to use this class?Habsburg
@bornie see the documentation on his github repo at github.com/nicolas-grekas/Patchwork-UTF8 . IMHO, this lib should be built right into PHP, if it hasn't already been! The fact it has over 2.7 million installs should tell you, for crying out loud.Cutoff
Broken link. Could someone fix?Scala
patchwork/utf8 is archived. Any good alternatives in 2021? Particulary interested in toAscii() method.Horrified
Just found transliterator_transliterate('Any-Latin; Latin-ASCII;', $string) to be working for my use.Horrified
R
6

From this website, I found something that might help you :

function removeAccents($str)
{
  $a = array('À', 'Á', 'Â', 'Ã', 'Ä', 'Å', 'Æ', 'Ç', 'È', 'É', 'Ê', 'Ë', 'Ì', 'Í', 'Î', 'Ï', 'Ð', 'Ñ', 'Ò', 'Ó', 'Ô', 'Õ', 'Ö', 'Ø', 'Ù', 'Ú', 'Û', 'Ü', 'Ý', 'ß', 'à', 'á', 'â', 'ã', 'ä', 'å', 'æ', 'ç', 'è', 'é', 'ê', 'ë', 'ì', 'í', 'î', 'ï', 'ñ', 'ò', 'ó', 'ô', 'õ', 'ö', 'ø', 'ù', 'ú', 'û', 'ü', 'ý', 'ÿ', 'Ā', 'ā', 'Ă', 'ă', 'Ą', 'ą', 'Ć', 'ć', 'Ĉ', 'ĉ', 'Ċ', 'ċ', 'Č', 'č', 'Ď', 'ď', 'Đ', 'đ', 'Ē', 'ē', 'Ĕ', 'ĕ', 'Ė', 'ė', 'Ę', 'ę', 'Ě', 'ě', 'Ĝ', 'ĝ', 'Ğ', 'ğ', 'Ġ', 'ġ', 'Ģ', 'ģ', 'Ĥ', 'ĥ', 'Ħ', 'ħ', 'Ĩ', 'ĩ', 'Ī', 'ī', 'Ĭ', 'ĭ', 'Į', 'į', 'İ', 'ı', 'IJ', 'ij', 'Ĵ', 'ĵ', 'Ķ', 'ķ', 'Ĺ', 'ĺ', 'Ļ', 'ļ', 'Ľ', 'ľ', 'Ŀ', 'ŀ', 'Ł', 'ł', 'Ń', 'ń', 'Ņ', 'ņ', 'Ň', 'ň', 'ʼn', 'Ō', 'ō', 'Ŏ', 'ŏ', 'Ő', 'ő', 'Œ', 'œ', 'Ŕ', 'ŕ', 'Ŗ', 'ŗ', 'Ř', 'ř', 'Ś', 'ś', 'Ŝ', 'ŝ', 'Ş', 'ş', 'Š', 'š', 'Ţ', 'ţ', 'Ť', 'ť', 'Ŧ', 'ŧ', 'Ũ', 'ũ', 'Ū', 'ū', 'Ŭ', 'ŭ', 'Ů', 'ů', 'Ű', 'ű', 'Ų', 'ų', 'Ŵ', 'ŵ', 'Ŷ', 'ŷ', 'Ÿ', 'Ź', 'ź', 'Ż', 'ż', 'Ž', 'ž', 'ſ', 'ƒ', 'Ơ', 'ơ', 'Ư', 'ư', 'Ǎ', 'ǎ', 'Ǐ', 'ǐ', 'Ǒ', 'ǒ', 'Ǔ', 'ǔ', 'Ǖ', 'ǖ', 'Ǘ', 'ǘ', 'Ǚ', 'ǚ', 'Ǜ', 'ǜ', 'Ǻ', 'ǻ', 'Ǽ', 'ǽ', 'Ǿ', 'ǿ');
  $b = array('A', 'A', 'A', 'A', 'A', 'A', 'AE', 'C', 'E', 'E', 'E', 'E', 'I', 'I', 'I', 'I', 'D', 'N', 'O', 'O', 'O', 'O', 'O', 'O', 'U', 'U', 'U', 'U', 'Y', 's', 'a', 'a', 'a', 'a', 'a', 'a', 'ae', 'c', 'e', 'e', 'e', 'e', 'i', 'i', 'i', 'i', 'n', 'o', 'o', 'o', 'o', 'o', 'o', 'u', 'u', 'u', 'u', 'y', 'y', 'A', 'a', 'A', 'a', 'A', 'a', 'C', 'c', 'C', 'c', 'C', 'c', 'C', 'c', 'D', 'd', 'D', 'd', 'E', 'e', 'E', 'e', 'E', 'e', 'E', 'e', 'E', 'e', 'G', 'g', 'G', 'g', 'G', 'g', 'G', 'g', 'H', 'h', 'H', 'h', 'I', 'i', 'I', 'i', 'I', 'i', 'I', 'i', 'I', 'i', 'IJ', 'ij', 'J', 'j', 'K', 'k', 'L', 'l', 'L', 'l', 'L', 'l', 'L', 'l', 'l', 'l', 'N', 'n', 'N', 'n', 'N', 'n', 'n', 'O', 'o', 'O', 'o', 'O', 'o', 'OE', 'oe', 'R', 'r', 'R', 'r', 'R', 'r', 'S', 's', 'S', 's', 'S', 's', 'S', 's', 'T', 't', 'T', 't', 'T', 't', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'W', 'w', 'Y', 'y', 'Y', 'Z', 'z', 'Z', 'z', 'Z', 'z', 's', 'f', 'O', 'o', 'U', 'u', 'A', 'a', 'I', 'i', 'O', 'o', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'A', 'a', 'AE', 'ae', 'O', 'o');
  return str_replace($a, $b, $str);
}

Usage example :

$text = 'Regular ascii text + čćžšđ + äöüß + éĕěėëȩ + æø€ + $ + ¶ + @';
echo removeAccents($text);

Displays :

Regular ascii text + cczsd + aous + eeeeeȩ + aeo€ + $ + ¶ + @

You'll need to improve it, but you get the idea... If there is a direct way to do such a work, I don't know it.

Rubin answered 28/11, 2012 at 21:41 Comment(3)
yes, I already suggested this solution as a "last-solution", but it is not what I am hoping for. +1 for trying thoughHensley
This works perfect! iconv's TRANSLIT is missing many characters.Dierdredieresis
I came up with a similar solution but with strtr instead of str_replace which is slightly faster in some cases eg longer strings.Spondee
A
2

I think setting the right locale is the way to go. Be aware, that the specific locale must also be available on the system, check it using locale -a. If you only have de_DE.utf8 - also you have to use set_locale(de_DE.utf8)

Assign answered 30/5, 2016 at 12:43 Comment(0)
P
1

As none of the solutions above worked for me (I needed to transliterate many European character sets to ASCII), I finally found this old PECL package which just seemed to work http://derickrethans.nl/projects.html#translit . I had problems especially with cyrillic character sets, and this seems to handle them perfectly.

Plovdiv answered 6/8, 2014 at 11:53 Comment(2)
Please open an own question if your answer is not specific to the initial question above.Only
@Only This isn't a question but a suggestionArsenal
S
1

If I have understood you correctly, I may have an answer for you: I've written a basic PHP class that allows you to convert most characters into their ASCII equivalents.

Below is a screenshot of its output converting various composer names with accents in their name.

You can fork it from github here https://github.com/LukeMadhanga/transliterator.

NB: It is as of yet undocumented but it should be p*** easy to get to grips with.

Example

Sutlej answered 11/9, 2015 at 15:49 Comment(0)
E
-1

I wrote this https://github.com/marekkowalczyk/sanitize in Golang; works well out of the box and is easy to improve.

Exception answered 14/3, 2021 at 23:18 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.