Javascript string.prototype.contains() with locale

Asked 17/9, 2016 at 15:4 Answered 27/5, 2024 at 22:15

Is it posible to check if a string contains a substring with locale support?

'Ábc'.contains('A') should be true.

Javascript now has the string.prototype.localeCompare() for string comparison with locale support but I cannot see the localeContains() counterpart.

Quartziferous answered 17/9, 2016 at 15:4 Comment(3)

But they are not the same character so why would any standard JS tool should evaluate them as equal? You might consider setting up a hash table. This can be a nice start for you. – Samford 17/9, 2016 at 15:10

It's true that they are not equal but it is a needed function to do any decent string filtering. Nobody wants to input an 'e' in a table filter and not get the 'José' value listed. Once the locale support is available and we now that 'a', 'A' and 'Á' can be sorted together its no a big step to give the option to consider then the same in a substring search. The link is fine but is a lost battle to manually consider all chars in any language. – Quartziferous 17/9, 2016 at 15:18

I guess you either have to do it manually or use a library like Javascript Unicode Library which can assist you to get Latin equivalents of your strings with accented characters as seen here – Samford 17/9, 2016 at 15:31

You can do this:

String.prototype.contains = function contains(charToCheck) {
  return this.split('').some(char => char.localeCompare(charToCheck, 'en', {sensitivity: 'base'}) === 0)
}

console.log('Ábc'.contains('A')) // true
console.log('Ábc'.contains('B')) // true
console.log('Ábc'.contains('b')) //true
console.log('Ábc'.contains('u')) //false
console.log('coté'.contains('e')) //true

Documentation on localCompare. Sensitivity base means:

"base": Only strings that differ in base letters compare as unequal. Examples: a ≠ b, a = á, a = A.

Stickybeak answered 8/11, 2019 at 21:46 Comment(0)

There is a faster alternative to contains() with locale check on string

It seems that to strip diacritics and then natively compare the strings is much faster: on my architecture almost 10 times faster than @chickens or @dag0310 solution, check yours here. Returns true on empty string check to be consistent with String.includes.

String.prototype.localeContains = function(sub) {
  if(sub==="") return true;
  if(!sub || !this.length) return false;
  sub = ""+sub;
  if(sub.length>this.length) return false;
  let ascii = s => s.normalize("NFKD").replace(/[\u0300-\u036f]/g, "").toLowerCase();
  return ascii(this).includes(ascii(sub));
}

var str = "142 Rozmočených Kříd";
console.log(str.localeContains("kŘi"));
console.log(str.localeContains(42));
console.log(str.localeContains(""));
console.log(str.localeContains(false));

NFKD

The NFKD modifier decomposes all precomposed characters to their base characters and combining marks, which are subsequently removed by replace call.

Important note

This doesn't work for some solutions, like "Straße".contains("SS"), depending on if you want to ß consider as substitute for SS. To find out more, check about canonical and compatibility transformations mentioned on MDN where also other forms besides NFKD are mentioned.

_{Credits to @LukasKalbertodt in the comments for mentioning the edge cases.}

Lucilla answered 18/10, 2021 at 23:28 Comment(13)

I don't know what kind of black magic are you applying, but it works like a charm. Thanks! – Bolanger 24/2, 2022 at 16:24

what is the sub = ""+sub; part good for? – Homotaxis 8/4, 2022 at 15:42

@Homotaxis it converts sub to string, so it contains length and normalize methods called below. Without it, localeContains would fail for non-string argument like 42 – Cann 8/4, 2022 at 23:50

Quick explanation of "black magic": normalize("NFD") decomposes all characters into base character and combining marks. The replace call removes all combining marks. Finally, it is lowercased. The name ascii is not quite correct, as the output is not necessarily ASCII. – Cavie 19/9, 2023 at 9:18

This unfortunately doesn't have exactly the same behavior as localeCompare. For example: "ß".localeCompare("SS", "de", { sensitivity: "base" }) returns 0, but this function here would return false. Same for ﬃ and ffi. Switching to "NFKD" normalization fixes the latter case at least. Converting and testing uppercased and lowercased versions should make the ß case work as well. But yeah, this is very tricky to get right! – Cavie 19/9, 2023 at 9:29

@LukasKalbertodt good point. Then it would be nice to have full list of canonical and compatibility transformations. If you happen to find some resource with the complete list, please share. I assumed only stripping diacritics to compare. This can become very funky when dealing with traditional/simplified Chinese. – Cann 19/9, 2023 at 13:29

Nope, I unfortunately don't have such a list. ß and ﬃ are just some of the standard "assumption breaking" unicode characters I test with in situations like these ^_^ – Cavie 19/9, 2023 at 14:45

This seems (on 1st read) like a great answer, but please move some explanation from the comments in answer and make it more complete. Gr8 work man. – Allege 26/9, 2023 at 9:32

@DanteTheSmith I agree. Updated. – Cann 26/9, 2023 at 10:5

I learned more about this topic and unfortunately this answer has more cases in which it has a different result than localeCompare. It comes down to the fact that it doesn't use the user's locale at all. "mäh".localeContains("a") should be true in German, but false in Swedish. "o" in "ø" true in English, false in Norwegian and Danish. "u" in "ü" true in German, false in Danish. So it should probably be called containsLoosely or something like that instead of localeContains. (Not to say this code isn't useful! It's just important to know its shortcomings.) – Cavie 26/9, 2023 at 10:33

@LukasKalbertodt Then it seems I do not understand the locale well. Can you provide some links where you learned more about this topic? – Cann 26/9, 2023 at 14:56

The introduction of the Unicode TR10 has some very nice examples. More examples throughout the document. You can put my examples into localeCompare (which OP referenced) and you will see the result I posted. E.g. "o".localeCompare("ø", "de", { sensitivity: "base" }) returns 0, "o".localeCompare("ø", "da", { sensitivity: "base" }) returns -1. – Cavie 27/9, 2023 at 15:5

Poor performance is caused by multiple calls to localeCompare because each call instantiates its own Collator. A more performant solution would be to create an instance of Collator and call its compare method. My rather old locale-index-of package does this among other features. – Diphyodont 23/5, 2024 at 22:15

If you are looking for more than one character here is a not very efficient but working option:

const localeContains = (a,b) => !!a.split('').filter((v,i)=>a.slice(i,b.length).localeCompare(b, "en", { sensitivity: 'base' })===0).length
a = "RESERVE ME";
b = "réservé";

console.log(localeContains(a,b));

Sporulate answered 3/7, 2020 at 18:40 Comment(0)

chickens' answer does not work if the searched string is not at the beginning of the main string.

Use this package instead: https://www.npmjs.com/package/locale-includes

localeIncludes('RESERVE ME', 'éservé', {usage: 'search', sensitivity: 'base'});
// true

To make it even nicer to use as a string prototype function:

String.prototype.localeIncludes = function(str) {
  return localeIncludes(this, str, {usage: 'search', sensitivity: 'base'});
};

'RESERVE ME'.localeIncludes('éservé');
// true

Tortuga answered 23/9, 2021 at 9:29 Comment(1)

You save a lot of my time. Thank you so much, sir. – Louise 17/1, 2023 at 5:20

You may either normalize string and usestring.includes

// inspired by https://mcmap.net/q/667750/-filtering-a-list-of-strings-based-on-user-locale
/**
 * Returns true if searchString appears as a substring of the result of converting first argument
 * to a String, at one or more positions that are greater than or equal to position,
 * if compared in the current or specified locale; otherwise, returns false.
 * Options is considered to have { usage: 'search', sensitivity: 'base' } defaults
 * @param {string} string search string
 * @param {string} searchString search string
 * @param {string|string[]=} locales A locale string or array of locale strings that contain one or more language or locale tags. If you include more than one locale string, list them in descending order of priority so that the first entry is the preferred locale. If you omit this parameter, the default locale of the JavaScript runtime is used. This parameter must conform to BCP 47 standards; see the Intl.Collator object for details.
 * @param {Intl.CollatorOptions=} options An object that contains one or more properties that specify comparison options. see the Intl.Collator object for details.
 * @param {number=} position If position is undefined, 0 is assumed, so as to search all of the String.
 * @returns {boolean}
 */
function localeIncludes(string, searchString, locales, options, position = 0) {
  const optionsN = { usage: 'search', sensitivity: 'base', ...options ?? {} };
  const collator = new Intl.Collator(locales, optionsN);
  const { sensitivity, ignorePunctuation } = collator.resolvedOptions();
  function localeNormalize(string) {
    // `localeCompare` MUST `ToString` its arguments
    // We want to normalize out strings so `u'` does not include `u` 
    let stringN = String(string).normalize('NFC');
    // If comparison is case-insensitive we want to normalize case
    if (sensitivity === 'base' || sensitivity === 'accent')
      stringN = stringN.toLocaleLowerCase(locales);
    // then we try to remove accents (you may cache letters in a Map to make it faster)
    return stringN.replaceAll(/./g, (letter) => {
      // first check if you can remove the character completely
      if (ignorePunctuation) {
        if (collator.compare(letter, '') === 0) return '';
      }
      let normalizedLetter = letter.normalize('NFD').replace(/[\u0300-\u036f]/gi, '');
      /*
       * // If you want you may add some custom normalizers (per-language)
       * const mapSv = new Map([ ['w', 'v'], ['ß', 'SS'] ])
       * if (lang === 'sv' && mapSv.has(letter)) return mapSv.get(letter);
       */
      return letter !== normalizedLetter && collator.compare(letter, normalizedLetter) === 0 ? normalizedLetter : letter;
    });
  }
  return localeNormalize(string).includes(localeNormalize(searchString));
}

or try to find a matching substring

/**
 * Returns true if searchString appears as a substring of the result of converting first argument
 * to a String, at one or more positions that are greater than or equal to position,
 * if compared in the current or specified locale; otherwise, returns false.
 * Collators with `numeric` and `ignorePunctuation` options are not supported.
 * @param {string} string search string
 * @param {string} searchString search string
 * @param {string|string[]=} locales A locale string or array of locale strings that contain one or more language or locale tags. If you include more than one locale string, list them in descending order of priority so that the first entry is the preferred locale. If you omit this parameter, the default locale of the JavaScript runtime is used. This parameter must conform to BCP 47 standards; see the Intl.Collator object for details.
 * @param {Intl.CollatorOptions=} options An object that contains one or more properties that specify comparison options. see the Intl.Collator object for details.
 * @param {number=} position If position is undefined, 0 is assumed, so as to search all of the String.
 * @returns {boolean}
 */
function localeIncludes(string, searchString, locales, options, position = 0) {
  // `localeCompare` uses `Intl.Collator.compare` under the hood
  // `localeCompare` casts `ToString` over both arguments
  // We don't want "á" to contain "a", so we should normalize the strings first.
  // `Intl.Collator` uses Canonical Equivalence according to the Unicode Standard, so normalization won't change the order
  const stringN = String(string).normalize();
  const searchStringN = String(searchString).normalize();
  const collator = new Intl.Collator(locales, options);
  /*
   * // if you can have strings of different length (like with `ignorePunctuation`), you'll have to check every substring
   * for (let i = 0; i < string.length; i++) {
   *   for (let j = i; j < string.length; j++) {
   *     // WARNING, THIS IS $ O(n^2) $
   *     let substring = string.substring(i, i + searchString.length);
   *     if (collator.compare(substring, searchString) === 0) return i;
   *   }
   * }
   */
  for (let i = position; i <= stringN.length - searchStringN.length; i++) {
    // non-numeric non-ignorePunctuation `collator` expected
    const substring = stringN.substring(i, i + searchStringN.length);
    if (collator.compare(substring, searchStringN) === 0)
      return true;
  }
  return false;
}

Or you can use a hack using window.find which works exacly as if you search on the page with Ctrl-F

function iframeIncludes(string, searchString, locale) {
  const iframe = document.createElement('iframe');
  iframe.style = 'position: fixed; top: 0; left: 0;';
  document.body.append(iframe);
  const iframeDoc = f.contentDocument;
  iframeDoc.open();
  // you MUST use <pre> otherwise it doesn't work
  iframeDoc.write(`
    <html lang="${locale}">
      <body>
        <pre></pre>
      </body>
    </html>
  `);
  iframeDoc.close();
  const pre = iframeDoc.querySelector('pre');
  pre.innerText = string;
  const result = iframe.contentWindow.find(searchString);
  iframe.remove();
  return result;
}

Saltish answered 25/9, 2023 at 20:57 Comment(15)

@LukasKalbertodt ping. Not really different form the lib in the other answer, but I guess faster, and has explanation. – Saltish 25/9, 2023 at 20:59

Thanks for the answer. This unfortunately has the same problem as the library: it uses the naive string search algorithm which has quite a suboptimal asymptotic runtime. Sure, in many situations that doesn't matter, but this certainly isn't the best one can do :/ Regardless of that: your answer could use an introduction sentence or two, just saying "this just uses localeCompare at every possible position" or sth like that. – Cavie 26/9, 2023 at 6:45

@LukasKalbertodt Thanks for pointing that out, I did find the correct way now – Saltish 26/9, 2023 at 8:20

@LukasKalbertodt there is a { usage: "search" } collator option, but I didn't find any mention of someone using that (and failed to make a working script) /tableflip – Saltish 26/9, 2023 at 8:58

Yeah, that option is interesting. It sounds exactly like what we want, but there is not really a method to properly use it with. We only have compare which checks for full string matches. So it seems like one really has to build a substring search on top of that. Which ... huff. The other approach I see is understanding the relevant Unicode documents in more detail and implementing everything manually. I am researching right now as well. – Cavie 26/9, 2023 at 9:3

@LukasKalbertodt you may try #47330096 (lang-normalizing characters and using string.includes), but idk how'll that work with multichars like aa and ff. Probably won't. – Saltish 26/9, 2023 at 9:6

@LukasKalbertodt added localeNormalize variant. Could you make some test cases? I never used languages with accents – Saltish 26/9, 2023 at 9:30

@LukasKalbertodt added window.find method – Saltish 26/9, 2023 at 10:25

I collected a bunch of test cases. But these are by far not exhaustive. gist.github.com/LukasKalbertodt/… – Cavie 26/9, 2023 at 10:29

Ha, didn't know window.find. While probably not relevant for production code, points for thinking outside the box. I find your first solution very interesting as it only replaces accented characters if the collator says they are equal. That might solve the tricky a ä o ø u ü cases. I would recommend putting all explaining comments into your first two code boxes directly and then removing the last section "explanation". – Cavie 26/9, 2023 at 10:39

@LukasKalbertodt you should RCF Intl.Collator.find srsly. ... I did look up, it's github.com/tc39/ecma402/issues/506 – Saltish 26/9, 2023 at 10:40

Oh thanks, that's a super helpful link! Yes I agree this should be a browser API. Nearly impossible to implement it yourself 100% correctly. – Cavie 26/9, 2023 at 10:41

@LukasKalbertodt added "replace punctuation with empty" normalizer case, "custom normalizer map", removed explanation. Now should work with punctuation – Saltish 26/9, 2023 at 10:59

@LukasKalbertodt does this answer your question btw? – Saltish 26/9, 2023 at 11:0

Thanks for the updates and the research you put into this. While your answer doesn't provide a perfect solution (I don't think there currently is; we have to wait for an RFC), it is helpful and provides the best implementation of all answers (I think). Will award the bounty to your answer. Thank you. – Cavie 26/9, 2023 at 11:4

Internally localeCompare instantiates Intl.Collator for every call, and since the search for a substring will attempt comparison multiple times, it’s best to instantiate it once by ourselves, and then use its compare method.

Another gotcha is hidden in the iteration along the string: unfortunately the naive methods will split some complex graphemes in half. To avoid this, we should use Intl.Segmenter or for-of string in its absence.

If you try to "normalize" strings to use non-locale-aware contains, you will enter a land of a thousand traps, breaking stuff in subtle and interesting ways, which you likely don’t want to learn about. Curious? Others have already linked to this issue, dealing exactly with adding support for localeContains, it’s comments is a trove of crunchy edge cases. Some more details are in the ECMA402 meeting notes on this topic.

My takeaway from reading experts’ opinions in the links above is that even with Intl.Segmenter and Intl.Collator some edge cases remain. Maybe they will be addressed via the potential Intl.Search object. Until then I’ve combined the best approaches in the locale-index-of library.

Diphyodont answered 27/5, 2024 at 22:15 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags