Javascript string.prototype.contains() with locale
Asked Answered
Q

6

19

Is it posible to check if a string contains a substring with locale support?

'Ábc'.contains('A') should be true.

Javascript now has the string.prototype.localeCompare() for string comparison with locale support but I cannot see the localeContains() counterpart.

Quartziferous answered 17/9, 2016 at 15:4 Comment(3)
But they are not the same character so why would any standard JS tool should evaluate them as equal? You might consider setting up a hash table. This can be a nice start for you.Samford
It's true that they are not equal but it is a needed function to do any decent string filtering. Nobody wants to input an 'e' in a table filter and not get the 'José' value listed. Once the locale support is available and we now that 'a', 'A' and 'Á' can be sorted together its no a big step to give the option to consider then the same in a substring search. The link is fine but is a lost battle to manually consider all chars in any language.Quartziferous
I guess you either have to do it manually or use a library like Javascript Unicode Library which can assist you to get Latin equivalents of your strings with accented characters as seen hereSamford
S
7

You can do this:

String.prototype.contains = function contains(charToCheck) {
  return this.split('').some(char => char.localeCompare(charToCheck, 'en', {sensitivity: 'base'}) === 0)
}

console.log('Ábc'.contains('A')) // true
console.log('Ábc'.contains('B')) // true
console.log('Ábc'.contains('b')) //true
console.log('Ábc'.contains('u')) //false
console.log('coté'.contains('e')) //true

Documentation on localCompare. Sensitivity base means:

"base": Only strings that differ in base letters compare as unequal. Examples: a ≠ b, a = á, a = A.

Stickybeak answered 8/11, 2019 at 21:46 Comment(0)
L
6

There is a faster alternative to contains() with locale check on string

It seems that to strip diacritics and then natively compare the strings is much faster: on my architecture almost 10 times faster than @chickens or @dag0310 solution, check yours here. Returns true on empty string check to be consistent with String.includes.

String.prototype.localeContains = function(sub) {
  if(sub==="") return true;
  if(!sub || !this.length) return false;
  sub = ""+sub;
  if(sub.length>this.length) return false;
  let ascii = s => s.normalize("NFKD").replace(/[\u0300-\u036f]/g, "").toLowerCase();
  return ascii(this).includes(ascii(sub));
}

var str = "142 Rozmočených Kříd";
console.log(str.localeContains("kŘi"));
console.log(str.localeContains(42));
console.log(str.localeContains(""));
console.log(str.localeContains(false));

NFKD

The NFKD modifier decomposes all precomposed characters to their base characters and combining marks, which are subsequently removed by replace call.

Important note

This doesn't work for some solutions, like "Straße".contains("SS"), depending on if you want to ß consider as substitute for SS. To find out more, check about canonical and compatibility transformations mentioned on MDN where also other forms besides NFKD are mentioned.

Credits to @LukasKalbertodt in the comments for mentioning the edge cases.

Lucilla answered 18/10, 2021 at 23:28 Comment(13)
I don't know what kind of black magic are you applying, but it works like a charm. Thanks!Bolanger
what is the sub = ""+sub; part good for?Homotaxis
@Homotaxis it converts sub to string, so it contains length and normalize methods called below. Without it, localeContains would fail for non-string argument like 42Cann
Quick explanation of "black magic": normalize("NFD") decomposes all characters into base character and combining marks. The replace call removes all combining marks. Finally, it is lowercased. The name ascii is not quite correct, as the output is not necessarily ASCII.Cavie
This unfortunately doesn't have exactly the same behavior as localeCompare. For example: "ß".localeCompare("SS", "de", { sensitivity: "base" }) returns 0, but this function here would return false. Same for and ffi. Switching to "NFKD" normalization fixes the latter case at least. Converting and testing uppercased and lowercased versions should make the ß case work as well. But yeah, this is very tricky to get right!Cavie
@LukasKalbertodt good point. Then it would be nice to have full list of canonical and compatibility transformations. If you happen to find some resource with the complete list, please share. I assumed only stripping diacritics to compare. This can become very funky when dealing with traditional/simplified Chinese.Cann
Nope, I unfortunately don't have such a list. ß and ffi are just some of the standard "assumption breaking" unicode characters I test with in situations like these ^_^Cavie
This seems (on 1st read) like a great answer, but please move some explanation from the comments in answer and make it more complete. Gr8 work man.Allege
@DanteTheSmith I agree. Updated.Cann
I learned more about this topic and unfortunately this answer has more cases in which it has a different result than localeCompare. It comes down to the fact that it doesn't use the user's locale at all. "mäh".localeContains("a") should be true in German, but false in Swedish. "o" in "ø" true in English, false in Norwegian and Danish. "u" in "ü" true in German, false in Danish. So it should probably be called containsLoosely or something like that instead of localeContains. (Not to say this code isn't useful! It's just important to know its shortcomings.)Cavie
@LukasKalbertodt Then it seems I do not understand the locale well. Can you provide some links where you learned more about this topic?Cann
The introduction of the Unicode TR10 has some very nice examples. More examples throughout the document. You can put my examples into localeCompare (which OP referenced) and you will see the result I posted. E.g. "o".localeCompare("ø", "de", { sensitivity: "base" }) returns 0, "o".localeCompare("ø", "da", { sensitivity: "base" }) returns -1.Cavie
Poor performance is caused by multiple calls to localeCompare because each call instantiates its own Collator. A more performant solution would be to create an instance of Collator and call its compare method. My rather old locale-index-of package does this among other features.Diphyodont
S
2

If you are looking for more than one character here is a not very efficient but working option:

const localeContains = (a,b) => !!a.split('').filter((v,i)=>a.slice(i,b.length).localeCompare(b, "en", { sensitivity: 'base' })===0).length
a = "RESERVE ME";
b = "réservé";

console.log(localeContains(a,b));
Sporulate answered 3/7, 2020 at 18:40 Comment(0)
T
2

chickens' answer does not work if the searched string is not at the beginning of the main string.

Use this package instead: https://www.npmjs.com/package/locale-includes

localeIncludes('RESERVE ME', 'éservé', {usage: 'search', sensitivity: 'base'});
// true

To make it even nicer to use as a string prototype function:

String.prototype.localeIncludes = function(str) {
  return localeIncludes(this, str, {usage: 'search', sensitivity: 'base'});
};

'RESERVE ME'.localeIncludes('éservé');
// true
Tortuga answered 23/9, 2021 at 9:29 Comment(1)
You save a lot of my time. Thank you so much, sir.Louise
S
1

You may either normalize string and usestring.includes

// inspired by https://mcmap.net/q/667750/-filtering-a-list-of-strings-based-on-user-locale
/**
 * Returns true if searchString appears as a substring of the result of converting first argument
 * to a String, at one or more positions that are greater than or equal to position,
 * if compared in the current or specified locale; otherwise, returns false.
 * Options is considered to have { usage: 'search', sensitivity: 'base' } defaults
 * @param {string} string search string
 * @param {string} searchString search string
 * @param {string|string[]=} locales A locale string or array of locale strings that contain one or more language or locale tags. If you include more than one locale string, list them in descending order of priority so that the first entry is the preferred locale. If you omit this parameter, the default locale of the JavaScript runtime is used. This parameter must conform to BCP 47 standards; see the Intl.Collator object for details.
 * @param {Intl.CollatorOptions=} options An object that contains one or more properties that specify comparison options. see the Intl.Collator object for details.
 * @param {number=} position If position is undefined, 0 is assumed, so as to search all of the String.
 * @returns {boolean}
 */
function localeIncludes(string, searchString, locales, options, position = 0) {
  const optionsN = { usage: 'search', sensitivity: 'base', ...options ?? {} };
  const collator = new Intl.Collator(locales, optionsN);
  const { sensitivity, ignorePunctuation } = collator.resolvedOptions();
  function localeNormalize(string) {
    // `localeCompare` MUST `ToString` its arguments
    // We want to normalize out strings so `u'` does not include `u` 
    let stringN = String(string).normalize('NFC');
    // If comparison is case-insensitive we want to normalize case
    if (sensitivity === 'base' || sensitivity === 'accent')
      stringN = stringN.toLocaleLowerCase(locales);
    // then we try to remove accents (you may cache letters in a Map to make it faster)
    return stringN.replaceAll(/./g, (letter) => {
      // first check if you can remove the character completely
      if (ignorePunctuation) {
        if (collator.compare(letter, '') === 0) return '';
      }
      let normalizedLetter = letter.normalize('NFD').replace(/[\u0300-\u036f]/gi, '');
      /*
       * // If you want you may add some custom normalizers (per-language)
       * const mapSv = new Map([ ['w', 'v'], ['ß', 'SS'] ])
       * if (lang === 'sv' && mapSv.has(letter)) return mapSv.get(letter);
       */
      return letter !== normalizedLetter && collator.compare(letter, normalizedLetter) === 0 ? normalizedLetter : letter;
    });
  }
  return localeNormalize(string).includes(localeNormalize(searchString));
}

or try to find a matching substring

/**
 * Returns true if searchString appears as a substring of the result of converting first argument
 * to a String, at one or more positions that are greater than or equal to position,
 * if compared in the current or specified locale; otherwise, returns false.
 * Collators with `numeric` and `ignorePunctuation` options are not supported.
 * @param {string} string search string
 * @param {string} searchString search string
 * @param {string|string[]=} locales A locale string or array of locale strings that contain one or more language or locale tags. If you include more than one locale string, list them in descending order of priority so that the first entry is the preferred locale. If you omit this parameter, the default locale of the JavaScript runtime is used. This parameter must conform to BCP 47 standards; see the Intl.Collator object for details.
 * @param {Intl.CollatorOptions=} options An object that contains one or more properties that specify comparison options. see the Intl.Collator object for details.
 * @param {number=} position If position is undefined, 0 is assumed, so as to search all of the String.
 * @returns {boolean}
 */
function localeIncludes(string, searchString, locales, options, position = 0) {
  // `localeCompare` uses `Intl.Collator.compare` under the hood
  // `localeCompare` casts `ToString` over both arguments
  // We don't want "á" to contain "a", so we should normalize the strings first.
  // `Intl.Collator` uses Canonical Equivalence according to the Unicode Standard, so normalization won't change the order
  const stringN = String(string).normalize();
  const searchStringN = String(searchString).normalize();
  const collator = new Intl.Collator(locales, options);
  /*
   * // if you can have strings of different length (like with `ignorePunctuation`), you'll have to check every substring
   * for (let i = 0; i < string.length; i++) {
   *   for (let j = i; j < string.length; j++) {
   *     // WARNING, THIS IS $ O(n^2) $
   *     let substring = string.substring(i, i + searchString.length);
   *     if (collator.compare(substring, searchString) === 0) return i;
   *   }
   * }
   */
  for (let i = position; i <= stringN.length - searchStringN.length; i++) {
    // non-numeric non-ignorePunctuation `collator` expected
    const substring = stringN.substring(i, i + searchStringN.length);
    if (collator.compare(substring, searchStringN) === 0)
      return true;
  }
  return false;
}

Or you can use a hack using window.find which works exacly as if you search on the page with Ctrl-F

function iframeIncludes(string, searchString, locale) {
  const iframe = document.createElement('iframe');
  iframe.style = 'position: fixed; top: 0; left: 0;';
  document.body.append(iframe);
  const iframeDoc = f.contentDocument;
  iframeDoc.open();
  // you MUST use <pre> otherwise it doesn't work
  iframeDoc.write(`
    <html lang="${locale}">
      <body>
        <pre></pre>
      </body>
    </html>
  `);
  iframeDoc.close();
  const pre = iframeDoc.querySelector('pre');
  pre.innerText = string;
  const result = iframe.contentWindow.find(searchString);
  iframe.remove();
  return result;
}
Saltish answered 25/9, 2023 at 20:57 Comment(15)
@LukasKalbertodt ping. Not really different form the lib in the other answer, but I guess faster, and has explanation.Saltish
Thanks for the answer. This unfortunately has the same problem as the library: it uses the naive string search algorithm which has quite a suboptimal asymptotic runtime. Sure, in many situations that doesn't matter, but this certainly isn't the best one can do :/ Regardless of that: your answer could use an introduction sentence or two, just saying "this just uses localeCompare at every possible position" or sth like that.Cavie
@LukasKalbertodt Thanks for pointing that out, I did find the correct way nowSaltish
@LukasKalbertodt there is a { usage: "search" } collator option, but I didn't find any mention of someone using that (and failed to make a working script) /tableflipSaltish
Yeah, that option is interesting. It sounds exactly like what we want, but there is not really a method to properly use it with. We only have compare which checks for full string matches. So it seems like one really has to build a substring search on top of that. Which ... huff. The other approach I see is understanding the relevant Unicode documents in more detail and implementing everything manually. I am researching right now as well.Cavie
@LukasKalbertodt you may try #47330096 (lang-normalizing characters and using string.includes), but idk how'll that work with multichars like aa and ff. Probably won't.Saltish
@LukasKalbertodt added localeNormalize variant. Could you make some test cases? I never used languages with accentsSaltish
@LukasKalbertodt added window.find methodSaltish
I collected a bunch of test cases. But these are by far not exhaustive. gist.github.com/LukasKalbertodt/…Cavie
Ha, didn't know window.find. While probably not relevant for production code, points for thinking outside the box. I find your first solution very interesting as it only replaces accented characters if the collator says they are equal. That might solve the tricky a ä o ø u ü cases. I would recommend putting all explaining comments into your first two code boxes directly and then removing the last section "explanation".Cavie
@LukasKalbertodt you should RCF Intl.Collator.find srsly. ... I did look up, it's github.com/tc39/ecma402/issues/506Saltish
Oh thanks, that's a super helpful link! Yes I agree this should be a browser API. Nearly impossible to implement it yourself 100% correctly.Cavie
@LukasKalbertodt added "replace punctuation with empty" normalizer case, "custom normalizer map", removed explanation. Now should work with punctuationSaltish
@LukasKalbertodt does this answer your question btw?Saltish
Thanks for the updates and the research you put into this. While your answer doesn't provide a perfect solution (I don't think there currently is; we have to wait for an RFC), it is helpful and provides the best implementation of all answers (I think). Will award the bounty to your answer. Thank you.Cavie
D
0

Internally localeCompare instantiates Intl.Collator for every call, and since the search for a substring will attempt comparison multiple times, it’s best to instantiate it once by ourselves, and then use its compare method.

Another gotcha is hidden in the iteration along the string: unfortunately the naive methods will split some complex graphemes in half. To avoid this, we should use Intl.Segmenter or for-of string in its absence.

If you try to "normalize" strings to use non-locale-aware contains, you will enter a land of a thousand traps, breaking stuff in subtle and interesting ways, which you likely don’t want to learn about. Curious? Others have already linked to this issue, dealing exactly with adding support for localeContains, it’s comments is a trove of crunchy edge cases. Some more details are in the ECMA402 meeting notes on this topic.

My takeaway from reading experts’ opinions in the links above is that even with Intl.Segmenter and Intl.Collator some edge cases remain. Maybe they will be addressed via the potential Intl.Search object. Until then I’ve combined the best approaches in the locale-index-of library.

Diphyodont answered 27/5 at 22:15 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.