Can * be used in sym tokens for more than one character?
Asked Answered
S

3

5

The example for sym shows * (WhateverCode) standing in for a single symbol

grammar Foo {
    token TOP { <letter>+ }
    proto token letter {*}
    token letter:sym<P> { <sym> }
    token letter:sym<e> { <sym> }
    token letter:sym<r> { <sym> }
    token letter:sym<l> { <sym> }
    token letter:sym<*> {   .   }
}.parse("I ♥ Perl", actions => class {
    method TOP($/) { make $<letter>.grep(*.<sym>).join }
}).made.say; # OUTPUT: «Perl␤» 

It will, however, fail if we use it to stand in for a symbol composed of several letters:

grammar Foo {
    token TOP { <action>+ % " " }
    proto token action {*}
    token action:sym<come> { <sym> }
    token action:sym<bebe> { <sym> }
    token action:sym<*> { . }
}.parse("come bebe ama").say; # Nil

Since sym, by itself, does work with symbols with more than one character, how can we define a default sym token that matches a set of characters?

Staphylococcus answered 1/7, 2019 at 7:33 Comment(0)
E
5

Can * be used in sym tokens for more than one character? ... The example for sym shows * (WhateverCode) standing in for a single symbol

It's not a WhateverCode or Whatever.1

The <...> in foo:sym<...> is a quote words constructor, so the ... is just a literal string.

That's why this works:

grammar g { proto token foo {*}; token foo:sym<*> { <sym> } }
say g.parse: '*', rule => 'foo'; # matches

As far as P6 is concerned, the * in foo:sym<*> is just a random string. It could be abracadabra. I presume the writer chose * to represent the mental concept of "whatever" because it happens to match the P6 concept Whatever. Perhaps they were being too cute.

For the rest of this answer I will write JJ instead of * wherever the latter is just an arbitrary string as far as P6 is concerned.


The * in the proto is a Whatever. But that's completely unrelated to your question:

grammar g { proto token foo {*}; token foo:sym<JJ> { '*' } }
say g.parse: '*', rule => 'foo'; # matches

In the body of a rule (tokens and regexes are rules) whose name includes a :sym<...> part, you can write <sym> and it will match the string between the angles of the :sym<...>:

grammar g { proto token foo {*}; token foo:sym<JJ> { <sym> } }
say g.parse: 'JJ', rule => 'foo'; # matches

But you can write anything you like in the rule/token/regex body. A . matches a single character:

grammar g { proto token foo {*}; token foo:sym<JJ> { . } }
say g.parse: '*', rule => 'foo'; # matches

It will, however, fail if we use it to stand in for a symbol composed of several letters

No. That's because you changed the grammar.

If you change the grammar back to the original coding (apart from the longer letter:sym<...>s) it works fine:

grammar Foo {
  token TOP { <letter>+ }
  proto token letter {*}
  token letter:sym<come> { <sym> }
  token letter:sym<bebe> { <sym> }
  token letter:sym<JJ> { . }
}.parse(
   "come bebe ama",
   actions => class { method TOP($/) { make $<letter>.grep(*.<sym>).join } })
 .made.say; # OUTPUT: «comebebe␤»

Note that in the original, the letter:sym<JJ> token is waiting in the wings to match any single character -- and that includes a single space, so it matches those and they're dealt with.

But in your modification you added a required space between tokens in the TOP token. That had two effects:

  • It matched the space after "come" and after "bebe";

  • After the "a" was matched by letter:sym<JJ>, the lack of a space between the "a" and "m" meant the overall match failed at that point.

sym, by itself, does work with symbols with more than one character

Yes. All token foo:sym<bar> { ... } does is add:

  • A multiple dispatch alternative to foo;

  • A token sym, lexically scoped to the body of the foo token, that matches 'bar'.

how can we define a default sym token that matches a set of characters?

You can write such a sym token but, to be clear, because you don't want it to match a fixed string it can't use the <sym> in the body.(Because a <sym> has to be a fixed string.) If you still want to capture under the key sym then you could write $<sym>= in the token body as Håkon showed in a comment under their answer. But it could also be letter:whatever with $<sym>= in the body.

I'm going to write it as a letter:default token to emphasize that it being :sym<something> doesn't make any difference. (As explained above, the :sym<something> is just about being an alternative, along with other :baz<...>s and :bar<...>s, with the only addition being that if it's :sym<something>, then it also makes a <sym> subrule available in the body of the associated rule, which, if used, matches the fixed string 'something'.)

The winning dispatch among all the rule foo:bar:baz:qux<...> alternatives is chosen according to LTM logic among the rules starting with foo. So you need to write such a token that does not win as a longest token prefix but only matches if nothing else matches.

To immediately go to the back of the pack in an LTM race, insert a {} at the start of the rule body2:

token letter:default { {} \w+ }

Now, from the back of the pack, if this rule gets a chance it'll match with the \w+ pattern, which will stop the token when it hits a non-word character.

The bit about making it match if nothing else matches may mean listing it last. So:

grammar Foo {
  token TOP { <letter>+ % ' ' }
  proto token letter {*}
  token letter:sym<come> { <sym> }    # matches come
  token letter:sym<bebe> { <sym> }    # matches bebe
  token letter:boo       { {} \w**6 } # match 6 char string except eg comedy
  token letter:default   { {} \w+ }   # matches any other word
}.parse(
   "come bebe amap",
   actions => class { method TOP($/) { make $<letter>.grep(*.<sym>).join } })
 .made.say; # OUTPUT: «comebebe␤»

that just can't be the thing causing it ... "come bebe ama" shouldn't work in your grammar

The code had errors which I've now fixed and apologize for. If you run it you'll find it works as advertised.

But your comment prodded me to expand my answer. Hopefully it now properly answers your question.

Footnote

1 Not that any of this has anything to do with what's actually going on but... In P6 a * in "term position" (in English, where a noun belongs, in general programming lingo, where a value belongs) is a Whatever, not a WhateverCode. Even when * is written with an operator, eg. +* or * + *, rather than on its own, the *s are still just Whatevers, but the compiler automatically turns most such combinations of one or more *s with one or more operators into a sub-class of Code called a WhateverCode. (Exceptions are listed in a table here.)

2 See footnote 2 in my answer to SO "perl6 grammar , not sure about some syntax in an example".

Endamage answered 1/7, 2019 at 12:25 Comment(4)
Now I am thoroughly confused. The only thing you've changed is the space separator, so that just can't be the thing causing it. Also, "come bebe ama" shouldn't work in your grammar, since you are not actually specifying any separator there. sym, by itself does not include a separator.Staphylococcus
Hi JJ. In the original grammar, the "abracadabra" alternative :sym<...> had a body that matched ., which is any character, which includes a single space. So that :sym<...>, by itself, was a separator matcher (as part of being a default matcher). But while that explains what wasn't working, it doesn't give you a solution. I've updated my answer to better explain the problem and to provide a solution. If you're still confused, maybe leave it for a couple days. But please do let me know anything you're still confused about when you reread my answer when you come back to this grammar. TIA.Endamage
OK, I'm getting it now, and starting to understand where I was confused. Since <.whatever> means skipping that capture, I thought . in this context meant skipping the character. Also, the character was actually skipped. So I'll have to check this again. Thanks for the clarification!Staphylococcus
Ah, that makes sense. To be clear, it was still in the parse tree under .<letter> it just wasn't under .<letter>.grep(*.<sym>). That's because the token body was just . rather than $<sym>=.. (And the others were able to just use <sym> rather than, say, $<sym>=come.) Anyhoo, thank you for your patience while I iterated toward a good answer and letting me know I'd made progress. :)Endamage
P
3

The :sym<...> contents are for the reader of your program, not for the compiler, and are used to distinguish multi tokens of otherwise identical names.

It just so happened that programmers started to write grammars like this:

token operator:sym<+> { '+' }
token operator:sym<-> { '-' }
token operator:sym</> { '/' }

To avoid duplicating the symbols (here +, -, /), a special rule <sym> was introduced that matches whatever is inside :sym<...> as a literal, so you can write the above tokens as

token operator:sym<+> { <sym> }
token operator:sym<-> { <sym> }
token operator:sym</> { <sym> }

If you don't use <sym> inside the regex, you are free to write anything you want inside :sym<...>, so you can write something like

token operator:sym<fallback> { . }
Phlegmy answered 6/7, 2019 at 7:49 Comment(0)
E
1

Maybe like this:

grammar Foo {
    token TOP { <action>+ % " " }
    proto token action {*}
    token action:sym<come> { <sym> }
    token action:sym<bebe> { <sym> }
    token action:sym<default> { \w+ }
}.parse("come bebe ama").say;

Output:

「come bebe ama」
 action => 「come」
  sym => 「come」
 action => 「bebe」
  sym => 「bebe」
 action => 「ama」
Ergocalciferol answered 1/7, 2019 at 8:56 Comment(3)
Thanks for the answer, but it's not really an equivalent. You can't use <sym> within the token (because it will fail) or . to skip it as above. It will have to be processed specially, since we can't neither access <sym> nor simply skip it.Staphylococcus
@Staphylococcus Would token action:sym<default> { $<sym>=[\w+] } be of help?Stilbite
It would be kind of closer, but still wouldn't skip via . if required.Staphylococcus

© 2022 - 2024 — McMap. All rights reserved.