I've looked on Stack Overflow (replacing characters.. eh, how JavaScript doesn't follow the Unicode standard concerning RegExp, etc.) and haven't really found a concrete answer to the question "How can JavaScript match accented characters (those with diacritical marks)?"
I'm forcing a field in a UI to match the format: last_name, first_name
(last [comma space] first), and I want to provide support for diacritics, but evidently in JavaScript it's a bit more difficult than other languages/platforms.
This was my original version, until I wanted to add diacritic support:
/^[a-zA-Z]+,\s[a-zA-Z]+$/
Currently I'm debating one of three methods to add support, all of which I have tested and work (at least to some extent, I don't really know what the "extent" is of the second approach). Here they are:
Explicitly listing all accented characters that I would want to accept as valid (lame and overly-complicated):
var accentedCharacters = "àèìòùÀÈÌÒÙáéíóúýÁÉÍÓÚÝâêîôûÂÊÎÔÛãñõÃÑÕäëïöüÿÄËÏÖÜŸçÇßØøÅåÆæœ";
// Build the full regex
var regex = "^[a-zA-Z" + accentedCharacters + "]+,\\s[a-zA-Z" + accentedCharacters + "]+$";
// Create a RegExp from the string version
regexCompiled = new RegExp(regex);
// regexCompiled = /^[a-zA-ZàèìòùÀÈÌÒÙáéíóúýÁÉÍÓÚÝâêîôûÂÊÎÔÛãñõÃÑÕäëïöüÿÄËÏÖÜŸçÇßØøÅåÆæœ]+,\s[a-zA-ZàèìòùÀÈÌÒÙáéíóúýÁÉÍÓÚÝâêîôûÂÊÎÔÛãñõÃÑÕäëïöüÿÄËÏÖÜŸçÇßØøÅåÆæœ]+$/
- This correctly matches a last/first name with any of the supported accented characters in
accentedCharacters
.
My other approach was to use the .
character class, to have a simpler expression:
var regex = /^.+,\s.+$/;
- This would match for just about anything, at least in the form of:
something, something
. That's alright I suppose...
The last approach, which I just found might be simpler...
/^[a-zA-Z\u00C0-\u017F]+,\s[a-zA-Z\u00C0-\u017F]+$/
- It matches a range of Unicode characters - tested and working, though I didn't try anything crazy, just the normal stuff I see in our language department for faculty member names.
Here are my concerns:
The first solution is far too limiting, and sloppy and convoluted at that. It would need to be changed if I forgot a character or two, and that's just not very practical.
The second solution is better, concise, but it probably matches far more than it actually should. I couldn't find any real documentation on exactly what
.
matches, just the generalization of "any character except the newline character" (from a table on the MDN).The third solution seems the be the most precise, but are there any gotchas? I'm not very familiar with Unicode, at least in practice, but looking at a code table/continuation of that table,
\u00C0-\u017F
seems to be pretty solid, at least for my expected input.
- Faculty won't be submitting forms with their names in their native language (e.g., Arabic, Chinese, Japanese, etc.), so I don't have to worry about out-of-Latin-character-set characters
Which of these three approaches is most suited for the task? Or are there better solutions?
regex = /^[^,]+,\s[^,]+$/;
to prevent that. – Sturtevant.
atom matches anything except newlines" actually is quite exact :-) – Twine