TL;DR I'll start with a precise and relatively concise answer. The rest of this answer is for those wanting to know more about built in rules in general and/or to drill down into ident
in particular.
<.ident>
function/capture
Because of the .
, <.ident>
only matches, it doesn't capture[1]. For the rest of this answer I'll generally omit the .
because it makes no difference to a rule's meaning besides the capture aspect.
Just as you can invoke (aka "call") one function within the declaration of another in programming languages, so too you can invoke a rule/token/regex/method (hereafter I'll generally just use the term "rule") within the declaration of another rule. <foo>
is the syntax used to invoke a rule named foo
; so <ident>
invokes a (method) namedident
.
At the time I write this, XML::Grammar
grammar does not itself define/declare a rule named ident
. That means the call ends up dispatched to a built in declaration with that name.
The built in ident
rule does precisely the same as if it were declared as:
token ident {
[ <alpha> ]
[ <alnum> ]*
}
The official Predefined character classes doc should provide precise definitions of <alpha>
and <alnum>
. Alternatively, the relevant details are also included later on in this answer.
The bottom line is that ident
matches a string of one or more "alphanumeric" characters except that the first character cannot be a "number".
Thus both abc
or def123
match whereas 123abc
does not.
The rest of this answer
For those interested in detail worth knowing I've written the following sections:
Raku (standard language and class details)
Rakudo (high level implementation)
NQP (mid level implementation)
MoarVM (low level implementation)
The specification and "specification" of ident
(Corrections of) documentation of <ident>
, "character class" and "identifier"
ident
vs Raku identifiers
Raku (standard language and class details)
XML::Grammar
is a user defined Raku grammar. A Raku grammar is a class. ("Grammars are really just slightly specialized classes".)
A Raku rule is a regex is a method:
grammar foo { rule ident { ... } }
say foo.^lookup('ident').WHAT; # (Regex)
say Regex ~~ Method; # True
A rule call, like <ident>
, in a grammar, is typically invoked as a result of calling .parse
or similar on the grammar. The .parse
call matches the input string according to the rules in the grammar.
When an occurrence of <ident>
within XML::Grammar
is evaluated during a match, the result is an ident
method (rule) call on an instance of XML::Grammar
(the .parse
call creates an instance of its invocant if it's just a type object).
Because XML::Grammar
does not itself define a rule/method of that name, the ident
call is instead dispatched according to standard method resolution, er, rules. (I'm using the word "rules" here in the generic non-Raku specific sense. Ah, language.)
In Raku, any class created using a declaration of the form grammar foo { ... }
automatically inherits from the Grammar
class which in turn inherits from the Match
class:
say .^mro given grammar foo {} # ((foo) (Grammar) (Match) (Capture) (Cool) (Any) (Mu))
ident
is found in the built in Match
class.
Rakudo (high level implementation)
In the Rakudo compiler, the Match
class does
the role NQPMatchRole
.
This NQPMatchRole
is where the highest level implementation of ident
is found.
NQP (mid level implementation)
NQPMatchRole
is written in the nqp language, a subset of Raku used to bootstrap the full Raku, and the heart of NQP, a compiler toolkit.
Excerpting and reformatting just the most salient code from the ident
declaration, the match for the first character boils down to:
nqp::ord($target, $!pos) == 95
|| nqp::iscclass(nqp::const::CCLASS_ALPHABETIC, $target, $!pos)
This matches if the first character is either a _
(95
is the ASCII code / Unicode codepoint for an underscore) or a character matching a character class defined in NQP called CCLASS_ALPHABETIC
.
The other bit of salient code is:
nqp::findnotcclass( nqp::const::CCLASS_WORD
This matches zero or more subsequent characters in the character class CCLASS_WORD
.
A search of NQP for CCLASS_ALPHABETIC
shows several matches. The most useful seems to be an NQP test file. While this file makes it clear that CCLASS_WORD
is a superset of CCLASS_ALPHABETIC
, it doesn't make it clear what those classes actually match.
NQP targets multiple "backends" or concrete virtual machines. Given the relative paucity of Rakudo/NQP doc/tests of what these rules and character classes actually match, one has to look at one of its backends to verify what's what.
MoarVM (low level implementation)
MoarVM is the only officially supported backend.
A search of MoarVM for CCLASS
shows several matches.
The important one seems to be ops.c which includes a switch (cclass)
statement which in turn includes cases for MVM_CCLASS_ALPHABETIC
and MVM_CCLASS_WORD
that correspond to NQP's similarly named constants.
According to the code's comments:
CCLASS_ALPHABETIC
currently matches exactly the same characters as the full Raku or NQP <:L>
rule, i.e. the characters Unicode has classified as "Letters".
I think that means <alpha>
is equivalent to the union of CCLASS_ALPHABETIC
and _
(underscore).
CCLASS_WORD
matches the same plus <:Nd>
, i.e. decimal digits (in any human language, not just English).
I think that means the Raku / NQP <alnum>
rule is equivalent to CCLASS_WORD
.
The specification and "specification" of ident
The official specification of Raku is embodied in roast[2].
A search of roast for ident
shows several matches.
Most use <ident>
only incidentally, as part of testing something else. The specification requires that they work as shown, but you won't understand what <ident>
is supposed to do by looking at incidental usage.
Three tests clearly test <ident>
itself. One of those is essentially redundant, leaving two. I see no changes between the 6.c
and 6.c.errata
versions of these two matches:
From S05-mass/rx.t:
ok ('2+3 ab2' ~~ /<ident>/) && matchcheck($/, q/mob<ident>: <ab2 @ 4>/), 'capturing builtin <ident>';
ok
tests that its first argument returns True
. This call tests that <ident>
skips 2+3
and matches ab2
.
From S05-mass/charsets.t:
is $latin-chars.comb(/<ident>/).join(" "), "ABCDEFGHIJKLMNOPQRSTUVWXYZ _ abcdefghijklmnopqrstuvwxyz ª µ º ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö øùúûüýþÿ", 'ident chars';
is
tests that its first argument matches its second. This call tests what the ident
rule matches from a string consisting of the first 256 Unicode codepoints (the Latin-1 character set).
Here's a variation of this test that more clearly shows the matching that happens:
say ~$_ for $latin-chars ~~ m:g/<ident>/;
prints:
ABCDEFGHIJKLMNOPQRSTUVWXYZ
_
abcdefghijklmnopqrstuvwxyz
ª
µ
º
ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ
ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö
øùúûüýþÿ
But <ident>
will match a whole lot more than just a hundred or so characters from Latin-1. So, while the above tests cover what <ident>
is officially specified/tested to match, they clearly don't cover the full picture.
So let's look at the official speculation that may, with care, be considered related to "specification".
First, we note the warning at the top:
Note: these documents may be out of date.
For Perl 6 documentation see docs.perl6.org;
for specs, see the official test suite.
The term "specs" in this warning is short for "specification". As already explained, the official specification test suite is roast, not any human language verbiage.
(Some people still think of these historical design docs as "specifications" too, and refer to them as "specs", but the official view is that "specs", as applied to the design docs, should be considered to be short for "speculations" to emphasize that they are not something to be fully relied upon.)
A search for ident
in design.raku.org shows several matches.
The most useful match is in the Predefined Subrules section of S05:
These are some of the predefined subrules for any grammar or regex:
- ident ... Match an identifier.
Uhoh...
(Corrections of) documentation of <ident>
, "character class" and "identifier"
From Predefined character classes in the official doc:
Class Description
<ident> Identifier. Also a default rule.
This is misleading in three ways:
ident
is not a character class. Character classes match a single character in that character class; if used with a quantifier they just match a string of such characters, each of which can be any character from that class. In contrast <ident>
matches a particular pattern of characters. It may be one character but you can't control that; the rule is greedy, matching as many characters fit the pattern. If you apply a quantifier it controls repetition of the overall rule, not how many characters are included in a single match of the rule.
All built in rules are default rules. I think the default comment is there to emphasize that you can write your own ident
rule if you don't like the built-in pattern. This is true for all rules though it will typically make much less sense to override built ins such as canonical character classes like <lower>
(lowercase).
ident
does not match identifiers! Or, more accurately, it doesn't do so on its own for most Raku identifiers. See the next section for the details.
ident
vs Raku identifiers
my @Identifiers = < $bar %hash Foo Foo::Bar your_ident anothers' my-ident >;
say (~$/ if m/^<ident>$/ for @Identifiers); # (Foo your_ident)
say (~$/ if m/ <ident> / for @Identifiers); # (bar hash Foo Foo your_ident anothers my)
In nqp's grammar, which is defined in NQP's Grammar.nqp, there's:
token identifier { <.ident> [ <[\-']> <.ident> ]* }
In Raku's grammar, which is defined in Rakudo's Grammar.nqp, there's code that looks slightly different but has the exact same effect:
token apostrophe { <[ ' \- ]> }
token identifier { <.ident> [ <.apostrophe> <.ident> ]* }
So <identifier>
matches a pattern that includes one or more <ident>
s with <apostrophe>
s in between.
The ident
method is in NQPMatchRole
, which means it's a built-in that's part of the rule namespace of users' grammars.
But the identifier
methods are not exported by either Raku or nqp. So they are not part of the rule namespace of users' grammars.
If we write our own indentifier
token we can see it in action:
my token identifier { <.ident> [ <[\-']> <.ident> ]* }
my token sigil { <[$@%&]> }
say (~$/ if m/^ <sigil>? <identifier> $/ for @Identifiers)
displays:
($bar %hash Foo your_ident my-ident)
To summarize the above and some other considerations:
<ident>
matches just parts of what <identifier>
matches (though they're the same for simple names). Consider is-prime
. This is a Raku identifier but contains two <ident>
matches (is
and prime
).
<identifier>
matches just parts of "Raku identifiers" (though they're the same for simple names). Consider infix:<+>
. This is sometimes referred to as a Raku identifier but requires both an <identifier>
match and a match of :<+>
.
Raku identifiers are themselves just parts of names (though they're the same for the simplest names). Consider Foo-Bar::Baz-Qux
which contains two <identifier>
matches (each in turn containing two <ident>
matches).
Footnotes
[1] If you're not sure what a capture is, see Capturing, Named captures and Subrules.
[2] The official specification of Raku is a test suite called roast -- the Repository Of All Specification Tests. The latest version of a specific branch of roast defines a specific version of Raku. When I first wrote this answer there had only been two official branches/versions of roast, and therefore of Raku. The first was 6.c
aka 6.Christmas
. This was cut on Christmas day 2015 and has been deliberately left frozen since that day. The second was 6.c.errata
, which conservatively added corrections to 6.c
deemed sufficiently important and backwards compatible to be included in the (then) current official recommended version of Raku. An "officially compliant" Raku compiler passes some official branch of roast. The Rakudo compiler (then) passed 6.c.errata
. If you read all the tests involving a feature in, say, the 6.c.errata
branch of roast, then you'll have read a full definition of the officially specified meaning of that feature for the 6.c.errata
version of the Raku language.