Match C# Unicode Identifier using Regex
Asked Answered
B

3

6

What is the right way to match a C# identifier, specifically a property or field name, using .Net Regex patterns?

Background. I used to use the ASCII centric @"[_a-zA-Z][_a-zA-Z0-9]*" But now unicode uppercase and lowercase characters are legit, e.g. "AboöДЖem". How should I include these in the pattern?

Thanks, Max

Beekeeping answered 9/12, 2010 at 16:8 Comment(0)
B
-3

Is that problem solved by the predefined classes in regex \w will match öД.

Bergerac answered 9/12, 2010 at 16:23 Comment(2)
Thanks. Now I can do mixed programming in Glagolitic and Hieroglypics. ;)Beekeeping
You can't just use @"\w+" to match an identifier - it would include words that start with numbers - e.g. it would match on "12abc" which is an invalid identifier. I propose @"[\w-[0-9]]\w*" as a solution to that.Cess
I
8

Here's a version that takes into account the disallowed leading digits:

^(?:((?!\d)\w+(?:\.(?!\d)\w+)*)\.)?((?!\d)\w+)$

And here are some tests in PowerShell:

[regex]$regex = '(?x:
    ^                        # Start of string
    (?:
        (                    # Namespace
            (?!\d)\w+        #   Top-level namespace
            (?:\.(?!\d)\w+)* #   Subsequent namespaces
        )
        \.                   # End of namespaces period
    )?                       # Namespace is optional
    ((?!\d)\w+)              # Class name
    $                        # End of string
)'
@(
    'System.Data.Doohickey'
    '_1System.Data.Doohickey'
    'System.String'
    'System.Data.SqlClient.SqlConnection'
    'DoohickeyClass'
    'Stackoverflow.Q4400348.AboöДЖem'
    '1System.Data.Doohickey' # numbers not allowed at start of namespace
    'System.Data.1Doohickey' # numbers not allowed at start of class
    'global::DoohickeyClass' # "global::" not part of actual namespace
) | %{
    ($isMatch, $namespace, $class) = ($false, $null, $null)
    if ($_ -match $regex) {
        ($isMatch, $namespace, $class) = ($true, $Matches[1], $Matches[2])
    }
    new-object PSObject -prop @{
        'IsMatch'   = $isMatch
        'Name'      = $_
        'Namespace' = $namespace
        'Class'     = $class
    }
} | ft IsMatch, Name, Namespace, Class -auto
Ivett answered 19/10, 2012 at 18:40 Comment(3)
Does this allow for leading underscore?Clingfish
@Clingfish It does. (allow for a leading underscore)Mobile
This code doesn't match C#'s @ prefix that allows C# keywords to be used as identifiers, unfortunately.Catchy
F
6

According to http://msdn.microsoft.com/en-us/library/aa664670.aspx, and ignoring the keyword and unicode-escape-sequence stuff,

@?[_\p{Lu}\p{Ll}\p{Lt}\p{Lm}\p{Lo}\p{Nl}][\p{Lu}\p{Ll}\p{Lt}\p{Lm}\p{Lo}\p{Nl}\p{Mn}\p{Mc}\p{Nd}\p{Pc}\p{Cf}]*
Firenze answered 9/12, 2010 at 16:21 Comment(2)
I think you can simplify \p{Lu}\p{Ll}\p{Lt}\p{Lm}\p{Lo} to just \p{L}. Same for some of the other char classes used here. I suppose if that's what Microsoft have put in their standards it might be simplest to do the same.Axle
Here's the way the compiler determines it, right from the source, in case anyone is interested: github.com/dotnet/roslyn/blob/…Socialite
B
-3

Is that problem solved by the predefined classes in regex \w will match öД.

Bergerac answered 9/12, 2010 at 16:23 Comment(2)
Thanks. Now I can do mixed programming in Glagolitic and Hieroglypics. ;)Beekeeping
You can't just use @"\w+" to match an identifier - it would include words that start with numbers - e.g. it would match on "12abc" which is an invalid identifier. I propose @"[\w-[0-9]]\w*" as a solution to that.Cess

© 2022 - 2024 — McMap. All rights reserved.