Regex to ignore accents? PHP
Asked Answered
D

5

16

Is there anyway to make a Regex that ignores accents?

For example:

preg_replace("/$word/i", "<b>$word</b>", $str);

The "i" in the regex is to ignore case sensitive, but is there anyway to match, for example
java with Jávã?

I did try to make a copy of the $str, change the content to a no accent string and find the index of all the occurrences. But the index of the 2 strings seems to be different, even though it's just with no accents.

(I did a research, but all I could found is how to remove accents from a string)

Decastere answered 7/5, 2012 at 5:44 Comment(1)
Nice first question, by the way :)Rabi
R
7

I don't think, there is such a way. That would be locale-dependent and you probably want a "/u" switch first to enable UTF-8 in pattern strings.

I would probably do something like this.

function prepare($pattern)
{
   $replacements = Array("a" => "[áàäâ]",
                         "e" => "[éèëê]" ...);
   return str_replace(array_keys($replacements), $replacements, $pattern);  
}

pcre_replace("/(" . prepare($word) . ")/ui", "<b>\\1</b>", $str);

In your case, index was different, because unless you used mb_string you were probably dealing with UTF-8 which uses more than one byte per character.

Rabi answered 7/5, 2012 at 5:52 Comment(3)
Just another question: is there any way to replace by what it found? if java matches with Jávã I want to put Jávã between <b> and not java, which is the search.Decastere
I have implement this solution into a highlight function. You can se it here: #27932760Eyebright
Is there any way we can apply this solution for all diacritics?Fairing
R
2

Java and Jávã are different words, there's no native support in regex for removing accents, but you can include all possible combinations of characters with or without accents that you want to replace in your regex.

Like preg_replace("/java|Jávã|jáva|javã/i", "<b>$word</b>", $str);.

Good luck!

Rayraya answered 7/5, 2012 at 5:50 Comment(0)
S
1

Regex isn't the tool for you here.

The answer you're looking for is the strtr() function.

This function replaces specified characters in a string, and is exactly what you're looking for.

In your example, Jávã, you could use a strtr() call like this:

$replacements = array('á'=>'a', 'ã'=>'a');
$output = strtr("Jávã",$replacements);

$output will now contain Java.

Of course, you'll need a bigger $replacements array to deal with all the characters you want to work with. See the the manual page I linked for some examples of how people are using it.

Note that there isn't a simple blanket list of characters, because firstly it would be huge, and secondly, the same starting character may need to be translated differently in different contexts or languages.

Hope that helps.

Seldon answered 7/5, 2012 at 6:5 Comment(0)
T
1
<?php

if (!function_exists('htmlspecialchars_decode')) {
    function htmlspecialchars_decode($text) {
        return str_replace(array('&lt;','&gt;','&quot;','&amp;'),array('<','>','"','&'),$text);
    }
}

function removeMarkings($text) 
{
    $text=htmlentities($text);    
    // components (key+value = entity name, replace with key)
    $table1=array(
        'a'=>'grave|acute|circ|tilde|uml|ring',
        'ae'=>'lig',
        'c'=>'cedil',
        'e'=>'grave|acute|circ|uml',
        'i'=>'grave|acute|circ|uml',
        'n'=>'tilde',
        'o'=>'grave|acute|circ|tilde|uml|slash',
        's'=>'zlig', // maybe szlig=>ss would be more accurate?
        'u'=>'grave|acute|circ|uml',
        'y'=>'acute'
    );

    // direct (key = entity, replace with value)
    $table2=array(
        '&ETH;'=>'D',   // not sure about these character replacements
        '&eth;'=>'d',   // is an ð pronounced like a 'd'?
        '&THORN;'=>'B', // is a þ pronounced like a 'b'?
        '&thorn;'=>'b'  // don't think so, but the symbols looked like a d,b so...
    );

    foreach ($table1 as $k=>$v) $text=preg_replace("/&($k)($v);/i",'\1',$text);
    $text=str_replace(array_keys($table2),$table2,$text);    
    return htmlspecialchars_decode($text);
}

$text="Here two words, one in normal way and another in accent mode java and jává and me searched with java and it found both occurences(higlighted form this sentence) java and jává<br/>";
$find="java"; //The word going to higlight,trying to higlight both java and jává by this seacrh word
$text=utf8_decode($text);
$find=removeMarkings(utf8_decode($find)); $len=strlen($find);
preg_match_all('/\b'.preg_quote($find).'\b/i', removeMarkings($text), $matches, PREG_OFFSET_CAPTURE);
$start=0; $newtext="";
foreach ($matches[0] as $m) {
    $pos=$m[1];
    $newtext.=substr($text,$start,$pos-$start);
    $newtext.="<b>".substr($text,$pos,$len)."</b>";
    $start=$pos+$len;
}
$newtext.=substr($text,$start);
echo "<blockquote>",$newtext,"</blockquote>";

?>

I think something like this will help you, I got this one from a forum.. just take a look.

Toney answered 7/5, 2012 at 6:22 Comment(2)
check this also forums.devshed.com/php-development-5/…Toney
thats an interesting approach. indeed $text=preg_replace('/&(.)[^;]+;/i','$1',htmlentities($text)); works fine in 99% of the cases already.Goldstone
C
0

Set an appropriate locale (such as fr_FR, for example) and use the strcoll function to compare a string ignoring accents.

Caskey answered 7/5, 2012 at 6:26 Comment(1)
But how does that answer OP's question? He wants to match the string in a regexp.Rabi

© 2022 - 2024 — McMap. All rights reserved.