So after some research using '\w' in .NET is equivalent to:
public static class Extensions {
/// <summary>
/// The word categories.
/// </summary>
[NotNull]
private static readonly HashSet<UnicodeCategory> _wordCategories = new HashCollection<UnicodeCategory>(
new[]
{
UnicodeCategory.DecimalDigitNumber,
UnicodeCategory.UppercaseLetter,
UnicodeCategory.ConnectorPunctuation,
UnicodeCategory.LowercaseLetter,
UnicodeCategory.OtherLetter,
UnicodeCategory.TitlecaseLetter,
UnicodeCategory.ModifierLetter,
UnicodeCategory.NonSpacingMark,
});
/// <summary>
/// Determines whether the specified character is a word character (equivalent to '\w').
/// </summary>
/// <param name="c">The c.</param>
public static bool IsWord(this char c) => _wordCategories.Contains(char.GetUnicodeCategory(c));
}
I've written this as an extension method to be easy to use on any character c
just invoke c.IsWord()
which will return true
if the character is a word character. This should be significantly quicker than using a Regex.
Interestingly, this doesn't appear to match the .NET specification, in fact '\w' match 938 'NonSpacingMark' characters, which are not mentioned.
In total this matches 49,760 of the 65,535 characters, so the simple regex's often shown on the web are incomplete.
\w+
this would potentially match any word no matter how crazy as long as it's contents are either lower,upper case letters, numbers 1-9 and a few (10) special characters (like the _underscore). And would be shorthand for writing something like[a-zA-Z1-9_]+
– Leesa