When is white space really important in Perl6 grammars?
Asked Answered
O

2

7

can someone clarify when white space is significant in rules in Perl 6 grammars? I am learning some by trial and error, but can't seem to find the actual rules in the documentation.

Example 1:

rule number {
    <pm> \d '.'? \d*[ <pm> \d* ]?
}

rule pm {
    [ '+' || '-' ]?
}

Will match a number 2.68156e+154, and not care about the spaces that are present in rule number. However, if I add a space after \d*, it will fail. (i.e. <pm> \d '.'? \d* [ <pm> \d* ]? fails).

Example 2: If I am trying to find literals in the middle of a word, then spacing around them are important. I.e., in finding the entry Double_t Delta_phi_R_1_9_pTproj_13_dat_cent_fx3001[52] = {

grammar TOP {
    ^ .*? <word-to-find> .* ?
}
rule word-to-find {
    \w*?fx\w*
}

Will find the word. However, if the definition of the rule word-to-find is changed to : fx or \w* fx\w* or \w*fx \w* then it won't make a match.

Also, then definition '[52]' will match, while the definition 'fx[52]' will not.

Thanks for any insight. A pointer to the proper point in the documentation would help greatly! Thanks,

Objection answered 20/2, 2018 at 18:52 Comment(2)
I recommend using token instead of rule, and adding <.ws> manually.Reproduce
@BradGilbert is your recommendation for this specific case or intended to be more general?Senegal
P
7

can someone clarify when white space is significant in rules in Perl 6 grammars?

When :sigspace is in effect.

I'll provide a little more detail below. If you or anyone else reading this needs further details, let me know via comments and I'll expand further.

First, let's eliminate one possible source of confusion, namely the meaning of the words rule and regex in the context of Perl 6, before I provide the doc link.

The word rule may be used in either a generic sense ("the regular expression, string matching and general-purpose parsing facility of Perl 6") or as a keyword (rule). Similarly, regex may be used to mean much the same thing as the generic rule or as a keyword (regex).

With that preamble out of the way, here's a link to the :sigspace doc section.

Note that the rule keyword implicitly inserts a :sigspace such that it takes effect immediately following the first atom in the declared rule, and that the effect is lexical. See @smls's answer to another SO question, especially the first two bullet points, for detailed discussion of these two important details.

You may also find my answer to another SO question dealing with whitespace/tokenization helpful.

Hth.

Posse answered 21/2, 2018 at 0:3 Comment(0)
S
13

In a rule, whitespace is turned into a <.ws> (that is, a non-capturing call to the ws token) except:

  • At the start of the rule, before the first atom
  • At the start of a [ (group) or ( (positional capture)
  • After ||, |, and &
  • After a variable declaration (:my $x = 'foo';)
  • After a code block
  • After the % operator for introducing a separator
  • After the ~ goal-matching operator
  • After an internal modifier (such as :i)
  • Inside of a construct like $<var> = x

Or, probably easier to remember, it will be inserted after any construct that could match some characters and after any zero-width assertion.

An important design goal in these rules is to never insert <.ws> somewhere that impedes Longest Token Matching. For example, consider rule foo:sym<ba> { [ bar | baz ] }, which is equivalent to token foo:sym<ba> { [ bar <.ws> | baz <.ws> ] <.ws> }. The default ws implementation is non-declarative (thanks to its use of <!ww>), meaning that it would break longest token matching both at the protoregex level were it inserted at the start of the rule, or at the alternation level were it inserted at the start of the group or after |.

Note that these rules only apply to rule, not to token and regex. They can be switched on at any point using :s and switched off using :!s in any of those, however (rule really just means "pretend there's a :s at the start").

Finally, the ws rule (which defaults to token ws { <!ww> \s* }) can be overridden in a grammar to define what whitespace means in the language being parsed.

Strive answered 20/2, 2018 at 23:57 Comment(3)
Just for the record, I think there is currently a bug in rakudo that always interprets <.ws> as non-declarative, even if you override it in with a purely declarative token.Bufford
@JonathanWorthington thanks! For clarification (perhaps I'm just bad with the Perl6 docs), what is the token foo:sym<ba> syntax? Specifically, is : an adverb here, or some kind of C++-like namespace? Similarly, what are the brackets <> doing the token's name? mille grazieObjection
@Objection It's protoregex syntax: a way of spreading members of an alternation over many tokens/rules instead (which further allows roles or subclasses to add to the alternation). See docs.perl6.org/syntax/Proto%20regexes.html for more.Strive
P
7

can someone clarify when white space is significant in rules in Perl 6 grammars?

When :sigspace is in effect.

I'll provide a little more detail below. If you or anyone else reading this needs further details, let me know via comments and I'll expand further.

First, let's eliminate one possible source of confusion, namely the meaning of the words rule and regex in the context of Perl 6, before I provide the doc link.

The word rule may be used in either a generic sense ("the regular expression, string matching and general-purpose parsing facility of Perl 6") or as a keyword (rule). Similarly, regex may be used to mean much the same thing as the generic rule or as a keyword (regex).

With that preamble out of the way, here's a link to the :sigspace doc section.

Note that the rule keyword implicitly inserts a :sigspace such that it takes effect immediately following the first atom in the declared rule, and that the effect is lexical. See @smls's answer to another SO question, especially the first two bullet points, for detailed discussion of these two important details.

You may also find my answer to another SO question dealing with whitespace/tokenization helpful.

Hth.

Posse answered 21/2, 2018 at 0:3 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.