php iconv translit for removing accents: not working as excepted?

Asked 6/2, 2011 at 0:14 Answered 1/8, 2016 at 22:44

Solved php string unicode utf-8 unicode-normalization

consider this simple code:

echo iconv('UTF-8', 'ASCII//TRANSLIT', 'è');

it prints

`e

instead of just

do you know what I am doing wrong?

nothing changed after adding setlocale

setlocale(LC_COLLATE, 'en_US.utf8');
echo iconv('UTF-8', 'ASCII//TRANSLIT', 'è');

Samy answered 6/2, 2011 at 0:14 Comment(3)

First, this is a fundamentally evil and wrong thing to want to do. Second, the only reasonable approach is to render your code into Unicode’s Normalization Form D formed by canonical decomposition and then remove those resulting code points with the Mark property. It won’t “fix” everything, of course: Tschüß – Cid 6/2, 2011 at 12:2

Ignore tchris, this is THE way to do it, I use it in practice. The only error you made is that the locale "subclass" is setlocale(LC_CTYPE, 'en_US.UTF-8'); -> LC_TYPE, not _COLLATE. Tschüss. – Excitor 19/12, 2013 at 16:0

I'm having this same problem - it is certainly not LC_TYPE... that generates an error (for me at least). I've tried LC_ALL (which is what everyone else says) - with no effect. I'm putting in the string

CŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ

and getting CSOEZsoez"YyenuA'A^A~A"AAAECE'E^E"EI'I^I"ID~NO'O^O~O"OOU'U^U"U'Yssa'a^a~a"aaaece'e^e"ei'i^i"id~no'o^o~o"oou'u^u"u'y"y – Evaporimeter 15/10, 2015 at 21:59

I have this standard function to return valid url strings without the invalid url characters. The magic seems to be in the line after the //remove unwanted characters comment.

This is taken from the Symfony framework documentation: http://www.symfony-project.org/jobeet/1_4/Doctrine/en/08 which in turn is taken from http://php.vrana.cz/vytvoreni-pratelskeho-url.php but i don't speak Czech ;-)

function slugify($text)
{
  // replace non letter or digits by -
  $text = preg_replace('#[^\\pL\d]+#u', '-', $text);

  // trim
  $text = trim($text, '-');

  // transliterate
  if (function_exists('iconv'))
  {
    $text = iconv('utf-8', 'us-ascii//TRANSLIT', $text);
  }

  // lowercase
  $text = strtolower($text);

  // remove unwanted characters
  $text = preg_replace('#[^-\w]+#', '', $text);

  if (empty($text))
  {
    return 'n-a';
  }

  return $text;
}

echo slugify('é'); // --> "e"

Torment answered 6/2, 2011 at 0:32 Comment(3)

I know I could do a preg_replace like that after the transliterate by iconv... I only wanted to know if the behaviour descrived in my first post is standard or iconv can transliterate "better" – Samy 6/2, 2011 at 9:57

Sorrry but why there are 2 backslash in the preg_replace? shouldn't be just [^\pL\d] ? – Samy 6/2, 2012 at 13:8

What about plƒtre francin string where f does not get converted? – Silberman 16/7, 2013 at 7:34

cf @tchrist, with INTL php extension

https://www.php.net/manual/en/book.intl.php

preg_replace('/\pM*/u','',normalizer_normalize( $mystring, Normalizer::FORM_D));

eéèêëiîïoöôuùûüaâäÅ Ἥ ŐǟǠ ǺƶƈƉųŪŧȬƀ␢ĦŁȽŦ ƀǖ becomes

eeeeeiiiooouuuuaaaA Η OaA AƶƈƉuUŧOƀ␢ĦŁȽŦ ƀu

As tchrist emphasises, not all unicode characters are considered decomposable:

extract from Unicode charts:

U0080.pdf

00CF Ï LATIN CAPITAL LETTER I WITH DIAERESIS

≡ 0049 I 0308 ¨

NB this symbol « ≡ » indicate an available decomposition

00D0 Ð LATIN CAPITAL LETTER ETH

→ 00F0 ð latin small letter eth

→ 0110 Đ latin capital letter d with stroke

→ 0189 Ɖ latin capital letter african d

no decomposition available, IMHO strangely (we could consider ASCII letter D as an acceptable equivalent).

U0100.pdf

0110 Đ LATIN CAPITAL LETTER D WITH STROKE

→ 00D0 Ð latin capital letter eth

→ 0111 đ latin small letter d with stroke

→ 0189 Ɖ latin capital letter african d

even stranger: this one is identified as LATIN CAPITAL LETTER D (with stroke), but not decomposable as such! Perhaps a cooler solution should be to get the unicode description of each char, and compare it with the description of each ascii char (and replace accordingly). Anyone? ;-]

cf http://unicode.org/Public/UNIDATA/UnicodeData.txt

Slavic answered 8/8, 2012 at 15:28 Comment(1)

This is the only one that worked for me, on vanilla PHP7.2. – Kanazawa 27/6, 2019 at 12:2

It happen with me with pure iconv without php. The Trick was to set LANG environment value to en_US.UTF-8 (it was hu_HU.UTF-8 before, in my case). After it worked as expected.

Mopup answered 1/7, 2013 at 13:22 Comment(0)

When doing transliteration, you have to make sure that your LC_COLLATE is properly set, otherwise the default POSIX will be used.

Look at https://www.php.net/manual/en/function.setlocale.php

Ellaelladine answered 6/2, 2011 at 0:22 Comment(1)

same result as before with setlocale, (see first post) – Samy 6/2, 2011 at 9:52

I'm tempted to say "nothing", although this is a little outside my expertise. PHP's iconv() is notorious, and the inspiration for many workarounds, including

dropping to the system's iconv utility (Unix & Linux)
crafting a lookup table
replacing all accented characters with an ASCII equivalent as kind of a preprocessing stage
setting LC_COLLATE (which doesn't seem to work for everyone)
use htmlentities() instead of iconv()

Read the comments for iconv() documentation for more inspiration. (Or commiseration. Too close to call.)

Akel answered 6/2, 2011 at 0:50 Comment(0)

It seems the standard way to handle this is with a "removing accents" function which you can find in library's like flourish or CMS's like Wordpress. Iconv seems to be unable to translate accents (and rightly so) since this isn't a good idea for anything other than URL slugs.

Dawna answered 28/10, 2011 at 15:18 Comment(0)

-1

It seem that it depend of the php version...

TestCase #1

php -version

PHP 7.0.0RC8 (cli) (built: Nov 25 2015 12:36:50) ( NTS ) Copyright (c) 1997-2015 The PHP Group Zend Engine v3.0.0, Copyright (c) 1998-2015 Zend Technologies with Zend OPcache v7.0.6-dev, Copyright (c) 1999-2015, by Zend Technologies

php -r "var_dump(iconv('UTF-8', 'ASCII//TRANSLIT', 'è'));"

string(2) "`e"

TestCase #2

php -version

PHP 7.0.8-1~dotdeb+8.1 (cli) ( NTS ) Copyright (c) 1997-2016 The PHP Group Zend Engine v3.0.0, Copyright (c) 1998-2016 Zend Technologies with Zend OPcache v7.0.8-1~dotdeb+8.1, Copyright (c) 1999-2016, by Zend Technologies

php -r "var_dump(iconv('UTF-8', 'ASCII//TRANSLIT', 'è'));"

string(1) "e"

Keyhole answered 1/8, 2016 at 22:44 Comment(0)

TestCase #1

TestCase #2

Recommended topics

Hot tags