PHP: Replace umlauts with closest 7-bit ASCII equivalent in an UTF-8 string
Asked Answered
T

8

51

What I want to do is to remove all accents and umlauts from a string, turning "lärm" into "larm" or "andré" into "andre". What I tried to do was to utf8_decode the string and then use strtr on it, but since my source file is saved as UTF-8 file, I can't enter the ISO-8859-15 characters for all umlauts - the editor inserts the UTF-8 characters.

Obviously a solution for this would be to have an include that's an ISO-8859-15 file, but there must be a better way than to have another required include?

echo strtr(utf8_decode($input), 
           'ŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ',
           'SOZsozYYuAAAAAAACEEEEIIIIDNOOOOOOUUUUYsaaaaaaaceeeeiiiionoooooouuuuyy');

UPDATE: Maybe I was a bit inaccurate with what I try to do: I do not actually want to remove the umlauts, but to replace them with their closest "one character ASCII" equivalent.

Teamster answered 1/10, 2008 at 15:32 Comment(3)
Keep in mind that the string you produce will not necessarily have the same meaning as the original string, as discussed in this similar question. It's a serviceable approach for cleaning file names, but probably not something you'd want to do if you are planning to display your new string as text.Megaphone
Thanks for the hint. However the resulting string will be used as a simplified version fallback for search if "binary search" fails. Even more simplifications will be applied after this one - to allow illiterates to still find what they are looking for :)Teamster
There actually is a valid reason to do it for displayed characters. Generation of HTML 4.1 compliant id attributes for navigation menus. For example, if I have <h3>Für Elise</h3> and I want to generate an id anchor above it, <a id="FurElise" /> is the best I can do and still be compliant with html 4.1 which may be necessary for some older browsers.Meathead
B
59
iconv("utf-8","ascii//TRANSLIT",$input);

Extended example

Booth answered 1/10, 2008 at 15:38 Comment(9)
I had to add "setlocale(LC_ALL, 'en_US');" (sadly no locals for Germany seem to be available on my machine :( ), but then it works. Great! :)Teamster
Why does this solution return "o for ö on my machine and on the examples in the php reference it returns oe?Avowed
This does not work for Cyrillic characters. They are converted to ? question marks instead.Capernaum
This bombs with a value of false and gives me a notice that illegal characters were encountered...Filiation
To spikey's comment: if you set your locale to de_*.UTF8 (de_DE.UTF8, de_CH.UTF8, etc.), then umlauts will be converted to *e (ü->ue). Set it to en_US..UTF8 to get the desired effect.Tarweed
I have the same problem as spikey, setlocale stuff doesn't help also.Crazed
setlocale() depends on your operating system, is not thread-safe and wreaks havoc if you do it wrong (such as treating commas as periods in conversions). Either be careful (using LC_CTYPE instead of LC_ALL in this case) or stay away from it unless you know exactly what you're doing.Bersagliere
Use "ascii//translit//ignore" to prevent "illegal characters encountered" error.Unasked
If iconv() with ASCII//TRANSLIT doesn't work for you with German umlauts (ä/ö/ü => ae/oe/ue, despite setting setlocale() to a German utf8 locale, this answer to another question was the solution for me, using transliterator_transliterate() with de-ASCII supplied via the transliterator build string.Rolan
V
33

A little trick that doesn't require setting locales or having huge translation tables:

function Unaccent($string)
{
    if (strpos($string = htmlentities($string, ENT_QUOTES, 'UTF-8'), '&') !== false)
    {
        $string = html_entity_decode(preg_replace('~&([a-z]{1,2})(?:acute|cedil|circ|grave|lig|orn|ring|slash|tilde|uml);~i', '$1', $string), ENT_QUOTES, 'UTF-8');
    }

    return $string;
}

The only requirement for it to work properly is to save your files in UTF-8 (as you should already).

Voltcoulomb answered 10/5, 2011 at 13:14 Comment(1)
Works great for hungarianStaw
H
9

you can also try this

$string = "Fóø Bår";
$transliterator = Transliterator::createFromRules(':: Any-Latin; :: Latin-ASCII; :: NFD; :: [:Nonspacing Mark:] Remove; :: Lower(); :: NFC;', Transliterator::FORWARD);
echo $normalized = $transliterator->transliterate($string);

but you need to have http://php.net/manual/en/book.intl.php available

Horace answered 3/2, 2016 at 13:12 Comment(0)
T
1

Okay, found an obvious solution myself, but it's not the best concerning performance...

echo strtr(utf8_decode($input), 
           utf8_decode('ŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ'),
           'SOZsozYYuAAAAAAACEEEEIIIIDNOOOOOOUUUUYsaaaaaaaceeeeiiiionoooooouuuuyy');
Teamster answered 1/10, 2008 at 15:33 Comment(3)
It's not the best in terms of performance and it also produces incorrect result. Letters like Œ, Æ, etc. should decompose to two letters, not to one.Optimist
You have missed žščřďťňů, and that's just the ones I see on my keyboard. Whitelisting known characters is not the best solution.Case
@this.lau_ As mentioned in the question: I'm looking for the closest "one character ASCII", so no - two letter decomposition would not be correct for my use case. One letter is correct for what I'm looking to do.Teamster
F
1

If you are using WordPress, you can use the built-in function remove_accents( $string )

https://codex.wordpress.org/Function_Reference/remove_accents

However I noticed a bug : it doesn’t work on a string with a single character.

Firewood answered 1/6, 2018 at 14:15 Comment(1)
Despite not actually being an exact answer, I appreciate this answer as I'm using WordPress. So thanks! ;)Rajput
T
0

For Arabic and Persian users i recommend this way to remove diacritics:

    $diacritics = array('َ','ِ','ً','ٌ','ٍ','ّ','ْ','ـ');
    $search_txt = str_replace($diacritics, '', $diacritics);

For typing diacritics in Arabic keyboards u can use this Asci(those codes are Asci not Unicode) codes in windows editors typing diacritics directly or holding Alt + (type the code of diacritic character) This is the codes

ـَ(0243) ـِ(0246) ـُ(0245) ـً(0240) ـٍ(0242) ـٌ(0241) ـْ(0250) ـّ(0248) ـ ـ(0220)

Tecumseh answered 8/11, 2014 at 11:55 Comment(0)
G
0

I found that this one gives the most consistent results in French and German. with the meta tag set to utf-8, I have place it in a function to return a line from a array of words and it works perfect.

htmlentities (  $line, ENT_SUBSTITUTE   , 'utf-8' ) 
Germanic answered 24/8, 2016 at 0:18 Comment(1)
This will return HTML entities. eg München will become M&uuml;nchen. But the requested result should be Muenchen.Kaycekaycee
M
0

The canonical way to do this:

  1. Obtain the Normalization Form Canonical Decomposition of the text. See https://unicode.org/reports/tr15/ for Unicode Normalization Forms.
  2. Remove nonspacing marks.
  3. Obtain the Normalization Form Canonical Composition of the remaining text.

https://unicode-org.github.io/icu/userguide/transforms/general/

For example, to remove accents from characters, use the following transform:

NFD; [:Nonspacing Mark:] Remove; NFC.

I am a bit unsure why they have given this example as such when the page also notes

each transform rule consists of two colons followed by a transform name.

So we will add those. You need the intl extension which wraps the ICU library.

$t = \Transliterator::createFromRules(':: NFD; ::[:Nonspacing Mark:] Remove; :: NFC;');

Example

print $t->transliterate('أ');

This transforms U+0623 (Arabic Letter Alef with Hamza Above) to U+0627 (Arabic Letter Alef) ie it works with non-latin letters and their accents as well.

You can replace [:Nonspacing Mark:] with [:Mn:].

Melosa answered 20/6, 2023 at 16:24 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.