Using JavaScript to perform text matches with/without accented characters
Asked Answered
C

6

64

I am using an AJAX-based lookup for names that a user searches in a text box.

I am making the assumption that all names in the database will be transliterated to European alphabets (i.e. no Cyrillic, Japanese, Chinese). However, the names will still contain accented characters, such as ç, ê and even č and ć.

A simple search like "Micic" will not match "Mičić" though - and the user expectation is that it will.

The AJAX lookup uses regular expressions to determine a match. I have modified the regular expression comparison using this function in an attempt to match more accented characters. However, it's a little clumsy since it doesn't take into account all characters.

function makeComp (input)
{
    input = input.toLowerCase ();
    var output = '';
    for (var i = 0; i < input.length; i ++)
    {
        if (input.charAt (i) == 'a')
            output = output + '[aàáâãäåæ]'
        else if (input.charAt (i) == 'c')
            output = output + '[cç]';
        else if (input.charAt (i) == 'e')
            output = output + '[eèéêëæ]';
        else if (input.charAt (i) == 'i')
            output = output + '[iìíîï]';
        else if (input.charAt (i) == 'n')
            output = output + '[nñ]';
        else if (input.charAt (i) == 'o')
            output = output + '[oòóôõöø]';
        else if (input.charAt (i) == 's')
            output = output + '[sß]';
        else if (input.charAt (i) == 'u')
            output = output + '[uùúûü]';
        else if (input.charAt (i) == 'y')
            output = output + '[yÿ]'
        else
            output = output + input.charAt (i);
    }
    return output;
}

Apart from a substitution function like this, is there a better way? Perhaps to "deaccent" the string being compared?

Cocoon answered 18/4, 2011 at 9:2 Comment(1)
Thanks for the code, i used your function to replace the accented vowels in the input text and work fine.Rimester
M
144

There is a way to “"deaccent" the string being compared” without the use of a substitution function that lists all the accents you want to remove…

Here is the easiest solution I can think about to remove accents (and other diacritics) from a string.

See it in action:

var string = 'Ça été Mičić. ÀÉÏÓÛ';
console.log(string);

var string_norm = string.normalize('NFD').replace(/\p{Diacritic}/gu, ''); // Old method: .replace(/[\u0300-\u036f]/g, "");
console.log(string_norm);
  • .normalize(…) decomposes the letters and diacritics.
  • .replace(…) removes all the diacritics.
Myatt answered 16/8, 2018 at 9:40 Comment(2)
This is certainly the good, modern way to do it. Keep in mind that there is no support for IE or Safari < 10 for this, so you'll need to polyfill it. It's not a trivial polyfill either (though not enormous), so if you have a build that's size-sensitive and needs to run on old browsers it may not be the best option. That concern becomes less important every day, of course.Dapple
This will remove the accents, nice. Is there a similar approach to convert ł -> l, ß -> ss, æ -> ae, etc. or should it be done by coding individual replacements for each?Gravitation
D
19

Came upon this old thread and thought I'd try my hand at doing a fast function. I'm relying on the ordering of pipe-separated ORs setting variables when they match in the function replace() is calling. My goal was to use the standard regex-implementation javascript's replace() function uses as much as possible, so that the heavy-processing can take place in low-level browser-optimized space, instead of in expensive javascript char-by-char comparisons.

It's not scientific at all, but my old Huawei IDEOS android phone is sluggish when I plug the other functions in this thread in to my autocomplete, while this function zips along:

function accentFold(inStr) {
  return inStr.replace(
    /([àáâãäå])|([çčć])|([èéêë])|([ìíîï])|([ñ])|([òóôõöø])|([ß])|([ùúûü])|([ÿ])|([æ])/g, 
    function (str, a, c, e, i, n, o, s, u, y, ae) {
      if (a) return 'a';
      if (c) return 'c';
      if (e) return 'e';
      if (i) return 'i';
      if (n) return 'n';
      if (o) return 'o';
      if (s) return 's';
      if (u) return 'u';
      if (y) return 'y';
      if (ae) return 'ae';
    }
  );
}

If you're a jQuery dev, here's a handy example of using this function; you could use :icontains the same way you'd use :contains in a selector:

jQuery.expr[':'].icontains = function (obj, index, meta, stack) {
  return accentFold(
    (obj.textContent || obj.innerText || jQuery(obj).text() || '').toLowerCase()
  )
    .indexOf(accentFold(meta[3].toLowerCase())
  ) >= 0;
};
Dapple answered 8/4, 2013 at 18:23 Comment(0)
I
12

I searched and upvoted herostwist answer but kept searching and truly, here is a modern solution, core to JavaScript (string.localeCompare function)

var a = 'réservé'; // with accents, lowercase
var b = 'RESERVE'; // no accents, uppercase

console.log(a.localeCompare(b));
// expected output: 1
console.log(a.localeCompare(b, 'en', {sensitivity: 'base'}));
// expected output: 0

NOTE, however, that full support is still missing for some mobile browser !!!

Until then, keep watching out for full support across ALL platforms and env.

Is that all ?

No, we can go further right now and use string.toLocaleLowerCase function.

var dotted = 'İstanbul';

console.log('EN-US: ' + dotted.toLocaleLowerCase('en-US'));
// expected output: "istanbul"

console.log('TR: ' + dotted.toLocaleLowerCase('tr'));
// expected output: "istanbul"

Thank You !

Introvert answered 25/9, 2018 at 4:58 Comment(2)
"àéçî".toLocaleLowerCase('en-US') will return "àéçî", so it's very limitedLase
I cannot seem to compare correctly with localeCompare() if I compare multiple Strings. For example, "Prince" and "Marche" are returning 1 even though I wanted "Marché"Swerve
P
7

There is no easier way to "deaccent" that I can think of, but your substitution could be streamlined a little more:

var makeComp = (function(){

    var accents = {
            a: 'àáâãäåæ',
            c: 'ç',
            e: 'èéêëæ',
            i: 'ìíîï',
            n: 'ñ',
            o: 'òóôõöø',
            s: 'ß',
            u: 'ùúûü',
            y: 'ÿ'
        },
        chars = /[aceinosuy]/g;

    return function makeComp(input) {
        return input.replace(chars, function(c){
            return '[' + c + accents[c] + ']';
        });
    };

}());
Pep answered 18/4, 2011 at 9:11 Comment(0)
L
5

I think this is the neatest solution

var nIC = new Intl.Collator(undefined , {sensitivity: 'base'})
var cmp = nIC.compare.bind(nIC)

It will return 0 if the two strings are the same, ignoring accents.

Alternatively you try localecompare

'être'.localeCompare('etre',undefined,{sensitivity: 'base'})
Lachrymal answered 15/10, 2018 at 22:40 Comment(1)
My answer's 7 years old; this is (mostly) the right way to do it in 2020. I don't believe (going by MDN's examples) that you need to bind the compare method -- it should be created with the required context, since myNames.sort(nIC.compare) works just fine.Dapple
A
0

I made a Prototype Version of this:

String.prototype.strip = function() {
  var translate_re = /[öäüÖÄÜß ]/g;
  var translate = {
    "ä":"a", "ö":"o", "ü":"u",
    "Ä":"A", "Ö":"O", "Ü":"U",
    " ":"_", "ß":"ss"   // probably more to come
  };
    return (this.replace(translate_re, function(match){
        return translate[match];})
    );
};

Use like:

var teststring = 'ä ö ü Ä Ö Ü ß';
teststring.strip();

This will will change the String to a_o_u_A_O_U_ss

Avidity answered 25/5, 2011 at 11:13 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.