how to replace special characters with the ones they're based on in PHP?
Asked Answered
P

8

21

How do I replace:

  • "ã" with "a"
  • "é" with "e"

in PHP? Is this possible? I've read somewhere I could do some math with the ascii value of the base character and the ascii value of the accent, but I can't find any references now.

Pero answered 11/12, 2009 at 21:3 Comment(0)
H
15

This answer is incorrect. I didn't understand Unicode Normalization when I wrote it. Look at francadaval's comment and link

Check out the Normalizer class to do this. The documentation is good, so I'll just link it instead of repeating things here:

http://www.php.net/manual/en/class.normalizer.php

Specifically, the normalize member of that class:

http://www.php.net/manual/en/normalizer.normalize.php

Note that Unicode normalization has several forms, and you seem to want Normalization Form KD (NFKD) Compatibility Decomposition, though you should read the documentation to make sure.

You shouldn't try to roll your own function for this: There's way too many things that can go wrong, and using the provided function is a much better idea.

Hamitic answered 11/12, 2009 at 21:7 Comment(5)
This would actually be the cleanest (and most efficient way) to go as it wouldn't require managing any functions yourself (let alone carry unnecessary objects around)Naylor
Agreed, glad that PHP finally introduced this feature.Hasa
Are you sure this works? normalizer_normalize('olá', Normalizer::FORM_KD); // olaÌ I also tried all the other available forms and none seems to return just ola.Preponderate
It doesn't work that way. Look at my own answer to this question.Stepper
eliminating accents (and similar) is not the purpose of Normalizer.Boutwell
P
24

If you don't have access to the Normalizer class or just don't wish to use it you can use the following function to replace most (all?) of the common accentuations.

function Unaccent($string)
{
    return preg_replace('~&([a-z]{1,2})(acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml);~i', '$1', htmlentities($string, ENT_QUOTES, 'UTF-8'));
}
Preponderate answered 11/12, 2009 at 22:41 Comment(9)
Uncaught ReferenceError: ENT_QUOTES is not definedLadybird
@SimonArnold: Check your code for & symbols, or just copy+paste as is.Preponderate
when using it it removes all text when containing a single char like acute ect.Sample
My server returns nothing using this function, but when I removed the last two arguments from htmlentities, it works.Rosabella
@DaveScotese: You need to save your PHP files as UTF-8.Preponderate
It doesn't actually work for acute accent (try "ćwiek", for instance)Schweinfurt
@KonradMorawski: I guess it doesn't work with ć because it's not a valid (or recognizable) letter / acute accent combination for htmlentities (even trying to type that in Ubuntu yields: ç for me - and I have a dedicated ç button as I'm Portuguese). It does work if you try it with áéíóú however: codepad.org/BC4phlym.Preponderate
@AlixAxel that might be, but I don't see a rational reason why it would be excluded from the character set recognized by htmlentities. It's a valid character used in several languages - see en.wikipedia.org/wiki/%C4%86 The inability to type it in the Portuguese keyboard layout is irrelevant in this context. I have no way of knowing whether or not this would matter for the OP (or anyone reading this in an attempt to solve a similar problem - well, it did matter for me), but I think it is nevertheless worth to point it outSchweinfurt
@KonradMorawski: Indeed, my solution as is doesn't deal with all the characters in en.wikipedia.org/wiki/Latin-derived_alphabet. I've since made some additions but it doesn't work out of the box with htmlentities alone - mostly because htmlentities doesn't return ć when you encode it. I understand your need, just pointing it out that it works with some / common acute characters.Preponderate
H
15

This answer is incorrect. I didn't understand Unicode Normalization when I wrote it. Look at francadaval's comment and link

Check out the Normalizer class to do this. The documentation is good, so I'll just link it instead of repeating things here:

http://www.php.net/manual/en/class.normalizer.php

Specifically, the normalize member of that class:

http://www.php.net/manual/en/normalizer.normalize.php

Note that Unicode normalization has several forms, and you seem to want Normalization Form KD (NFKD) Compatibility Decomposition, though you should read the documentation to make sure.

You shouldn't try to roll your own function for this: There's way too many things that can go wrong, and using the provided function is a much better idea.

Hamitic answered 11/12, 2009 at 21:7 Comment(5)
This would actually be the cleanest (and most efficient way) to go as it wouldn't require managing any functions yourself (let alone carry unnecessary objects around)Naylor
Agreed, glad that PHP finally introduced this feature.Hasa
Are you sure this works? normalizer_normalize('olá', Normalizer::FORM_KD); // olaÌ I also tried all the other available forms and none seems to return just ola.Preponderate
It doesn't work that way. Look at my own answer to this question.Stepper
eliminating accents (and similar) is not the purpose of Normalizer.Boutwell
C
13

For those who don't have php 5.3, I found another solution that works well and seems very comprehensive. Here is a link to the author's website http://www.evaisse.net/2008/php-translit-remove-accent-unaccent-21001. Here is the function.

/**
 * Unaccent the input string string. An example string like `ÀØėÿᾜὨζὅБю`
 * will be translated to `AOeyIOzoBY`. More complete than :
 *   strtr( (string)$str,
 *          "ÀÁÂÃÄÅàáâãäåÒÓÔÕÖØòóôõöøÈÉÊËèéêëÇçÌÍÎÏìíîïÙÚÛÜùúûüÿÑñ",
 *          "aaaaaaaaaaaaooooooooooooeeeeeeeecciiiiiiiiuuuuuuuuynn" );
 *
 * @param $str input string
 * @param $utf8 if null, function will detect input string encoding
 * @author http://www.evaisse.net/2008/php-translit-remove-accent-unaccent-21001
 * @return string input string without accent
 */
function remove_accents( $str, $utf8=true )
{
    $str = (string)$str;
    if( is_null($utf8) ) {
        if( !function_exists('mb_detect_encoding') ) {
            $utf8 = (strtolower( mb_detect_encoding($str) )=='utf-8');
        } else {
            $length = strlen($str);
            $utf8 = true;
            for ($i=0; $i < $length; $i++) {
                $c = ord($str[$i]);
                if ($c < 0x80) $n = 0; # 0bbbbbbb
                elseif (($c & 0xE0) == 0xC0) $n=1; # 110bbbbb
                elseif (($c & 0xF0) == 0xE0) $n=2; # 1110bbbb
                elseif (($c & 0xF8) == 0xF0) $n=3; # 11110bbb
                elseif (($c & 0xFC) == 0xF8) $n=4; # 111110bb
                elseif (($c & 0xFE) == 0xFC) $n=5; # 1111110b
                else return false; # Does not match any model
                for ($j=0; $j<$n; $j++) { # n bytes matching 10bbbbbb follow ?
                    if ((++$i == $length)
                        || ((ord($str[$i]) & 0xC0) != 0x80)) {
                        $utf8 = false;
                        break;
                    }

                }
            }
        }

    }

    if(!$utf8)
        $str = utf8_encode($str);

    $transliteration = array(
    'IJ' => 'I', 'Ö' => 'O','Œ' => 'O','Ü' => 'U','ä' => 'a','æ' => 'a',
    'ij' => 'i','ö' => 'o','œ' => 'o','ü' => 'u','ß' => 's','ſ' => 's',
    'À' => 'A','Á' => 'A','Â' => 'A','Ã' => 'A','Ä' => 'A','Å' => 'A',
    'Æ' => 'A','Ā' => 'A','Ą' => 'A','Ă' => 'A','Ç' => 'C','Ć' => 'C',
    'Č' => 'C','Ĉ' => 'C','Ċ' => 'C','Ď' => 'D','Đ' => 'D','È' => 'E',
    'É' => 'E','Ê' => 'E','Ë' => 'E','Ē' => 'E','Ę' => 'E','Ě' => 'E',
    'Ĕ' => 'E','Ė' => 'E','Ĝ' => 'G','Ğ' => 'G','Ġ' => 'G','Ģ' => 'G',
    'Ĥ' => 'H','Ħ' => 'H','Ì' => 'I','Í' => 'I','Î' => 'I','Ï' => 'I',
    'Ī' => 'I','Ĩ' => 'I','Ĭ' => 'I','Į' => 'I','İ' => 'I','Ĵ' => 'J',
    'Ķ' => 'K','Ľ' => 'K','Ĺ' => 'K','Ļ' => 'K','Ŀ' => 'K','Ł' => 'L',
    'Ñ' => 'N','Ń' => 'N','Ň' => 'N','Ņ' => 'N','Ŋ' => 'N','Ò' => 'O',
    'Ó' => 'O','Ô' => 'O','Õ' => 'O','Ø' => 'O','Ō' => 'O','Ő' => 'O',
    'Ŏ' => 'O','Ŕ' => 'R','Ř' => 'R','Ŗ' => 'R','Ś' => 'S','Ş' => 'S',
    'Ŝ' => 'S','Ș' => 'S','Š' => 'S','Ť' => 'T','Ţ' => 'T','Ŧ' => 'T',
    'Ț' => 'T','Ù' => 'U','Ú' => 'U','Û' => 'U','Ū' => 'U','Ů' => 'U',
    'Ű' => 'U','Ŭ' => 'U','Ũ' => 'U','Ų' => 'U','Ŵ' => 'W','Ŷ' => 'Y',
    'Ÿ' => 'Y','Ý' => 'Y','Ź' => 'Z','Ż' => 'Z','Ž' => 'Z','à' => 'a',
    'á' => 'a','â' => 'a','ã' => 'a','ā' => 'a','ą' => 'a','ă' => 'a',
    'å' => 'a','ç' => 'c','ć' => 'c','č' => 'c','ĉ' => 'c','ċ' => 'c',
    'ď' => 'd','đ' => 'd','è' => 'e','é' => 'e','ê' => 'e','ë' => 'e',
    'ē' => 'e','ę' => 'e','ě' => 'e','ĕ' => 'e','ė' => 'e','ƒ' => 'f',
    'ĝ' => 'g','ğ' => 'g','ġ' => 'g','ģ' => 'g','ĥ' => 'h','ħ' => 'h',
    'ì' => 'i','í' => 'i','î' => 'i','ï' => 'i','ī' => 'i','ĩ' => 'i',
    'ĭ' => 'i','į' => 'i','ı' => 'i','ĵ' => 'j','ķ' => 'k','ĸ' => 'k',
    'ł' => 'l','ľ' => 'l','ĺ' => 'l','ļ' => 'l','ŀ' => 'l','ñ' => 'n',
    'ń' => 'n','ň' => 'n','ņ' => 'n','ʼn' => 'n','ŋ' => 'n','ò' => 'o',
    'ó' => 'o','ô' => 'o','õ' => 'o','ø' => 'o','ō' => 'o','ő' => 'o',
    'ŏ' => 'o','ŕ' => 'r','ř' => 'r','ŗ' => 'r','ś' => 's','š' => 's',
    'ť' => 't','ù' => 'u','ú' => 'u','û' => 'u','ū' => 'u','ů' => 'u',
    'ű' => 'u','ŭ' => 'u','ũ' => 'u','ų' => 'u','ŵ' => 'w','ÿ' => 'y',
    'ý' => 'y','ŷ' => 'y','ż' => 'z','ź' => 'z','ž' => 'z','Α' => 'A',
    'Ά' => 'A','Ἀ' => 'A','Ἁ' => 'A','Ἂ' => 'A','Ἃ' => 'A','Ἄ' => 'A',
    'Ἅ' => 'A','Ἆ' => 'A','Ἇ' => 'A','ᾈ' => 'A','ᾉ' => 'A','ᾊ' => 'A',
    'ᾋ' => 'A','ᾌ' => 'A','ᾍ' => 'A','ᾎ' => 'A','ᾏ' => 'A','Ᾰ' => 'A',
    'Ᾱ' => 'A','Ὰ' => 'A','ᾼ' => 'A','Β' => 'B','Γ' => 'G','Δ' => 'D',
    'Ε' => 'E','Έ' => 'E','Ἐ' => 'E','Ἑ' => 'E','Ἒ' => 'E','Ἓ' => 'E',
    'Ἔ' => 'E','Ἕ' => 'E','Ὲ' => 'E','Ζ' => 'Z','Η' => 'I','Ή' => 'I',
    'Ἠ' => 'I','Ἡ' => 'I','Ἢ' => 'I','Ἣ' => 'I','Ἤ' => 'I','Ἥ' => 'I',
    'Ἦ' => 'I','Ἧ' => 'I','ᾘ' => 'I','ᾙ' => 'I','ᾚ' => 'I','ᾛ' => 'I',
    'ᾜ' => 'I','ᾝ' => 'I','ᾞ' => 'I','ᾟ' => 'I','Ὴ' => 'I','ῌ' => 'I',
    'Θ' => 'T','Ι' => 'I','Ί' => 'I','Ϊ' => 'I','Ἰ' => 'I','Ἱ' => 'I',
    'Ἲ' => 'I','Ἳ' => 'I','Ἴ' => 'I','Ἵ' => 'I','Ἶ' => 'I','Ἷ' => 'I',
    'Ῐ' => 'I','Ῑ' => 'I','Ὶ' => 'I','Κ' => 'K','Λ' => 'L','Μ' => 'M',
    'Ν' => 'N','Ξ' => 'K','Ο' => 'O','Ό' => 'O','Ὀ' => 'O','Ὁ' => 'O',
    'Ὂ' => 'O','Ὃ' => 'O','Ὄ' => 'O','Ὅ' => 'O','Ὸ' => 'O','Π' => 'P',
    'Ρ' => 'R','Ῥ' => 'R','Σ' => 'S','Τ' => 'T','Υ' => 'Y','Ύ' => 'Y',
    'Ϋ' => 'Y','Ὑ' => 'Y','Ὓ' => 'Y','Ὕ' => 'Y','Ὗ' => 'Y','Ῠ' => 'Y',
    'Ῡ' => 'Y','Ὺ' => 'Y','Φ' => 'F','Χ' => 'X','Ψ' => 'P','Ω' => 'O',
    'Ώ' => 'O','Ὠ' => 'O','Ὡ' => 'O','Ὢ' => 'O','Ὣ' => 'O','Ὤ' => 'O',
    'Ὥ' => 'O','Ὦ' => 'O','Ὧ' => 'O','ᾨ' => 'O','ᾩ' => 'O','ᾪ' => 'O',
    'ᾫ' => 'O','ᾬ' => 'O','ᾭ' => 'O','ᾮ' => 'O','ᾯ' => 'O','Ὼ' => 'O',
    'ῼ' => 'O','α' => 'a','ά' => 'a','ἀ' => 'a','ἁ' => 'a','ἂ' => 'a',
    'ἃ' => 'a','ἄ' => 'a','ἅ' => 'a','ἆ' => 'a','ἇ' => 'a','ᾀ' => 'a',
    'ᾁ' => 'a','ᾂ' => 'a','ᾃ' => 'a','ᾄ' => 'a','ᾅ' => 'a','ᾆ' => 'a',
    'ᾇ' => 'a','ὰ' => 'a','ᾰ' => 'a','ᾱ' => 'a','ᾲ' => 'a','ᾳ' => 'a',
    'ᾴ' => 'a','ᾶ' => 'a','ᾷ' => 'a','β' => 'b','γ' => 'g','δ' => 'd',
    'ε' => 'e','έ' => 'e','ἐ' => 'e','ἑ' => 'e','ἒ' => 'e','ἓ' => 'e',
    'ἔ' => 'e','ἕ' => 'e','ὲ' => 'e','ζ' => 'z','η' => 'i','ή' => 'i',
    'ἠ' => 'i','ἡ' => 'i','ἢ' => 'i','ἣ' => 'i','ἤ' => 'i','ἥ' => 'i',
    'ἦ' => 'i','ἧ' => 'i','ᾐ' => 'i','ᾑ' => 'i','ᾒ' => 'i','ᾓ' => 'i',
    'ᾔ' => 'i','ᾕ' => 'i','ᾖ' => 'i','ᾗ' => 'i','ὴ' => 'i','ῂ' => 'i',
    'ῃ' => 'i','ῄ' => 'i','ῆ' => 'i','ῇ' => 'i','θ' => 't','ι' => 'i',
    'ί' => 'i','ϊ' => 'i','ΐ' => 'i','ἰ' => 'i','ἱ' => 'i','ἲ' => 'i',
    'ἳ' => 'i','ἴ' => 'i','ἵ' => 'i','ἶ' => 'i','ἷ' => 'i','ὶ' => 'i',
    'ῐ' => 'i','ῑ' => 'i','ῒ' => 'i','ῖ' => 'i','ῗ' => 'i','κ' => 'k',
    'λ' => 'l','μ' => 'm','ν' => 'n','ξ' => 'k','ο' => 'o','ό' => 'o',
    'ὀ' => 'o','ὁ' => 'o','ὂ' => 'o','ὃ' => 'o','ὄ' => 'o','ὅ' => 'o',
    'ὸ' => 'o','π' => 'p','ρ' => 'r','ῤ' => 'r','ῥ' => 'r','σ' => 's',
    'ς' => 's','τ' => 't','υ' => 'y','ύ' => 'y','ϋ' => 'y','ΰ' => 'y',
    'ὐ' => 'y','ὑ' => 'y','ὒ' => 'y','ὓ' => 'y','ὔ' => 'y','ὕ' => 'y',
    'ὖ' => 'y','ὗ' => 'y','ὺ' => 'y','ῠ' => 'y','ῡ' => 'y','ῢ' => 'y',
    'ῦ' => 'y','ῧ' => 'y','φ' => 'f','χ' => 'x','ψ' => 'p','ω' => 'o',
    'ώ' => 'o','ὠ' => 'o','ὡ' => 'o','ὢ' => 'o','ὣ' => 'o','ὤ' => 'o',
    'ὥ' => 'o','ὦ' => 'o','ὧ' => 'o','ᾠ' => 'o','ᾡ' => 'o','ᾢ' => 'o',
    'ᾣ' => 'o','ᾤ' => 'o','ᾥ' => 'o','ᾦ' => 'o','ᾧ' => 'o','ὼ' => 'o',
    'ῲ' => 'o','ῳ' => 'o','ῴ' => 'o','ῶ' => 'o','ῷ' => 'o','А' => 'A',
    'Б' => 'B','В' => 'V','Г' => 'G','Д' => 'D','Е' => 'E','Ё' => 'E',
    'Ж' => 'Z','З' => 'Z','И' => 'I','Й' => 'I','К' => 'K','Л' => 'L',
    'М' => 'M','Н' => 'N','О' => 'O','П' => 'P','Р' => 'R','С' => 'S',
    'Т' => 'T','У' => 'U','Ф' => 'F','Х' => 'K','Ц' => 'T','Ч' => 'C',
    'Ш' => 'S','Щ' => 'S','Ы' => 'Y','Э' => 'E','Ю' => 'Y','Я' => 'Y',
    'а' => 'A','б' => 'B','в' => 'V','г' => 'G','д' => 'D','е' => 'E',
    'ё' => 'E','ж' => 'Z','з' => 'Z','и' => 'I','й' => 'I','к' => 'K',
    'л' => 'L','м' => 'M','н' => 'N','о' => 'O','п' => 'P','р' => 'R',
    'с' => 'S','т' => 'T','у' => 'U','ф' => 'F','х' => 'K','ц' => 'T',
    'ч' => 'C','ш' => 'S','щ' => 'S','ы' => 'Y','э' => 'E','ю' => 'Y',
    'я' => 'Y','ð' => 'd','Ð' => 'D','þ' => 't','Þ' => 'T','ა' => 'a',
    'ბ' => 'b','გ' => 'g','დ' => 'd','ე' => 'e','ვ' => 'v','ზ' => 'z',
    'თ' => 't','ი' => 'i','კ' => 'k','ლ' => 'l','მ' => 'm','ნ' => 'n',
    'ო' => 'o','პ' => 'p','ჟ' => 'z','რ' => 'r','ს' => 's','ტ' => 't',
    'უ' => 'u','ფ' => 'p','ქ' => 'k','ღ' => 'g','ყ' => 'q','შ' => 's',
    'ჩ' => 'c','ც' => 't','ძ' => 'd','წ' => 't','ჭ' => 'c','ხ' => 'k',
    'ჯ' => 'j','ჰ' => 'h'
    );
    $str = str_replace( array_keys( $transliteration ),
                        array_values( $transliteration ),
                        $str);
    return $str;
}
//- remove_accents()
Cannonade answered 7/2, 2012 at 23:7 Comment(1)
Here are several more to add to this: 'ʼ' => "'",'̧' => '','ḩ' => 'h', 'ʼ'=>"'", '‘'=>"'", '’'=>"'", 'ừ'=>'u', 'ế'=>'e', 'ả'=>'a', 'ị'=>'i', 'ậ'=>'a', 'ệ'=>'e', 'ỉ'=>'i', 'ộ'=>'o', 'ồ'=>'o', 'ề'=>'e', 'ơ'=>'o', 'ạ'=>'a', 'ẵ'=>'a', 'ư'=>'u', 'ắ'=>'a', 'ằ'=>'a', 'ầ'=>'a', 'ḑ'=>'d', 'Ḩ'=>'H', 'Ḑ'=>'D', 'ḑ'=>'d', 'ş'=>'s', 'ā'=>'a', 'ţ'=>'t',Rowley
Q
8

Short str_replace use with custom chars:

<?php
  $original_string    = "¿Dónde está el niño que vive aquí? En el témpano o en el iglú. ÁFRICA, MÉXICO, ÍNDICE, CANCIÓN y NÚMERO.";

  $some_special_chars = array("á", "é", "í", "ó", "ú", "Á", "É", "Í", "Ó", "Ú", "ñ", "Ñ");
  $replacement_chars  = array("a", "e", "i", "o", "u", "A", "E", "I", "O", "U", "n", "N");

  $replaced_string    = str_replace($some_special_chars, $replacement_chars, $original_string);

  echo $replaced_string; // outputs '¿Donde esta el nino que vive aqui? En el tempano o en el iglu. AFRICA, MEXICO, INDICE, CANCION y NUMERO.'
?>
Quixote answered 9/12, 2010 at 0:41 Comment(0)
S
1

Especially when matching texts against each-other or against keywords, it is helpful to normalize the texts before. The following function removes all diacritics (marks like accents) from a given UTF8-encoded texts and returns ASCii-text.

Be sure to have the PHP-Normalizer-extension (intl and icu) installed.

Tipp: You may also want to map the text to lower case before execute matching procedures ...

<?php

function normalizeUtf8String( $s)
{
    // Normalizer-class missing!
    if (! class_exists("Normalizer", $autoload = false))
        return $original_string;


    // maps German (umlauts) and other European characters onto two characters before just removing diacritics
    $s    = preg_replace( '@\x{00c4}@u'    , "AE",    $s );    // umlaut Ä => AE
    $s    = preg_replace( '@\x{00d6}@u'    , "OE",    $s );    // umlaut Ö => OE
    $s    = preg_replace( '@\x{00dc}@u'    , "UE",    $s );    // umlaut Ü => UE
    $s    = preg_replace( '@\x{00e4}@u'    , "ae",    $s );    // umlaut ä => ae
    $s    = preg_replace( '@\x{00f6}@u'    , "oe",    $s );    // umlaut ö => oe
    $s    = preg_replace( '@\x{00fc}@u'    , "ue",    $s );    // umlaut ü => ue
    $s    = preg_replace( '@\x{00f1}@u'    , "ny",    $s );    // ñ => ny
    $s    = preg_replace( '@\x{00ff}@u'    , "yu",    $s );    // ÿ => yu


    // maps special characters (characters with diacritics) on their base-character followed by the diacritical mark
        // exmaple:  Ú => U´,  á => a`
    $s    = Normalizer::normalize( $s, Normalizer::FORM_D );


    $s    = preg_replace( '@\pM@u'        , "",    $s );    // removes diacritics


    $s    = preg_replace( '@\x{00df}@u'    , "ss",    $s );    // maps German ß onto ss
    $s    = preg_replace( '@\x{00c6}@u'    , "AE",    $s );    // Æ => AE
    $s    = preg_replace( '@\x{00e6}@u'    , "ae",    $s );    // æ => ae
    $s    = preg_replace( '@\x{0132}@u'    , "IJ",    $s );    // ? => IJ
    $s    = preg_replace( '@\x{0133}@u'    , "ij",    $s );    // ? => ij
    $s    = preg_replace( '@\x{0152}@u'    , "OE",    $s );    // Œ => OE
    $s    = preg_replace( '@\x{0153}@u'    , "oe",    $s );    // œ => oe

    $s    = preg_replace( '@\x{00d0}@u'    , "D",    $s );    // Ð => D
    $s    = preg_replace( '@\x{0110}@u'    , "D",    $s );    // Ð => D
    $s    = preg_replace( '@\x{00f0}@u'    , "d",    $s );    // ð => d
    $s    = preg_replace( '@\x{0111}@u'    , "d",    $s );    // d => d
    $s    = preg_replace( '@\x{0126}@u'    , "H",    $s );    // H => H
    $s    = preg_replace( '@\x{0127}@u'    , "h",    $s );    // h => h
    $s    = preg_replace( '@\x{0131}@u'    , "i",    $s );    // i => i
    $s    = preg_replace( '@\x{0138}@u'    , "k",    $s );    // ? => k
    $s    = preg_replace( '@\x{013f}@u'    , "L",    $s );    // ? => L
    $s    = preg_replace( '@\x{0141}@u'    , "L",    $s );    // L => L
    $s    = preg_replace( '@\x{0140}@u'    , "l",    $s );    // ? => l
    $s    = preg_replace( '@\x{0142}@u'    , "l",    $s );    // l => l
    $s    = preg_replace( '@\x{014a}@u'    , "N",    $s );    // ? => N
    $s    = preg_replace( '@\x{0149}@u'    , "n",    $s );    // ? => n
    $s    = preg_replace( '@\x{014b}@u'    , "n",    $s );    // ? => n
    $s    = preg_replace( '@\x{00d8}@u'    , "O",    $s );    // Ø => O
    $s    = preg_replace( '@\x{00f8}@u'    , "o",    $s );    // ø => o
    $s    = preg_replace( '@\x{017f}@u'    , "s",    $s );    // ? => s
    $s    = preg_replace( '@\x{00de}@u'    , "T",    $s );    // Þ => T
    $s    = preg_replace( '@\x{0166}@u'    , "T",    $s );    // T => T
    $s    = preg_replace( '@\x{00fe}@u'    , "t",    $s );    // þ => t
    $s    = preg_replace( '@\x{0167}@u'    , "t",    $s );    // t => t

    // remove all non-ASCii characters
    $s    = preg_replace( '@[^\0-\x80]@u'    , "",    $s );


    // possible errors in UTF8-regular-expressions
    if (empty($s))
        return $original_string;
    else
        return $s;
}
?>

The above function is mainly based on the following article: http://ahinea.com/en/tech/accented-translate.html

Sontich answered 12/5, 2010 at 11:17 Comment(1)
When i generated xml with this, an empty xml file generates for me. How to use this thing properlyPolyphone
S
1

use PEAR I18N_UnicodeNormalizer-1.0.0

include('…');

echo preg_replace(
 '/(\P{L})/ui', // replace all except members of Unicode class "letters", case insensitive
 '', // with nothing
 I18N_UnicodeNormalizer::toNFKD('ÅÉÏÔÙåéïôù') // ù → u + `
);

→ AEIOUaeiou

Shupe answered 18/10, 2010 at 22:34 Comment(0)
L
1

If none of the other solutions are working right for you, here's what worked for me:

<?php

$string = "áéíóúÁ—whatever";

// create an array of the hex codes of the characters you want to replace (formatted as shown) and whatever you want to replace them with.
$characters = array(
  "[\xF3]" => "&ocacute;", //ó
  "[\xFC]" => "&uuml;", //ü
  "[\xF1]" => "&ntilde;", //ñ
  "[\xEB]" => "&euml;", //ë
  "[\xE9]" => "&eacute;", //é
  "[\xBD]" => "&frac12;", //½
);
// note that you must use a two-digit hex code for whatever reason.
// So, for example, although the hex code for an em dash is 2014, you have to use 97 instead. ("[\x97]" => "&mdash;")

// separate the key->value array into two separate arrays. Or just make two arrays from the beginning, but it's easier to read this way.
foreach ($characters as $hex => $html) {
  $replaceThis[] = $hex;
  $replaceWith[] = $html;
}

$string = preg_replace($replaceThis, $replaceWith, $string);

?>

It may not be the most elegant solution, but it works and requires no knowledge of regular expressions.

Lunseth answered 4/7, 2011 at 13:1 Comment(0)
J
0

People often use str_replace or strtr and a big list of character to convert "from" and "to" -- even if that doesn't look quite pretty...

Another solution, I suppose, might be using something like iconv with the option //TRANSLIT -- but doesn't always work, from what I remember...

Also, if you are using PHP 5.3, the new Normalizer class might be interesting ;-)

Jodoin answered 11/12, 2009 at 21:10 Comment(1)
Pascal, please check my comment on WaffleMatt answer.Preponderate

© 2022 - 2024 — McMap. All rights reserved.