Raku Grammar: Use named regex without consuming matching string
Asked Answered
I

2

6

I have a probably easy to answer Raku grammar question. I wont to parse a log file and get back the entries log entry by log entry. A log entry can be just a line or a multi line string.

My draft code looks like this:

grammar Grammar::Entries {
    rule TOP { <logentries>+ }

    token logentries { <loglevel> <logentry> }
    token loglevel { 'DEBUG' | 'WARN' | 'INFO ' | 'ERROR' }
    token logentry { .*? <.finish> }
    token finish { <.loglevel> || $ }
}

That works for just the first line because in the second line the loglevel is consumed by the first line match although I used '.' in the regex <> that as far as I know means non-capturing.

Following are a log example:

INFO    2020-01-22T11:07:38Z    PID[8528]   TID[6736]:  Current process-name: C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe
INFO    2020-01-22T11:07:38Z    PID[8528]   TID[6736]:  Session data:
    PID: 1234
    TID: 1234
    Session: 1
INFO    2020-01-22T11:07:38Z    PID[8528]   TID[6736]:  Clean up.

What would be the right approach to get back the log entries even for multi line ones? Thanks!

Induce answered 22/5, 2020 at 12:9 Comment(0)
C
5

The .*? works but is inefficient.
It has to do a lot of backtracking.

To improve it you could use \N* which matches everything except a newline.

grammar Grammar::Entries {
    rule TOP { <logentries>+ }

    token logentries { <loglevel> <logentry> }
    token loglevel { 'DEBUG' | 'WARN' | 'INFO' | 'ERROR' }
    token logentry { \N* \n }
}

Then you would have to add the newline matching back in.

    token logentry {
      <logline>* %% \n
    }
    token logline { <!before \w> \N* }

This would work, but it still isn't great.


I would structure the grammar more like the thing you are trying to parse.

grammar Grammar::Entries {
    token TOP { <logentries>+ }

    token logentries { <loglevel> <logentry> }
    token loglevel { 'DEBUG' | 'WARN' | 'INFO' | 'ERROR' }
    token logentry { <logline>* }
    token logline { '    ' <(\N+)> \n? }
}

Since I noticed that the log lines always start with 4 spaces, we can use that to make sure that only lines that start with that are counted as a logline. This also deals with the remaining data on the line with the log level.

I really don't like that you have a token with a plural name that only matches one thing.
Basically I would name logentries as logentry. Of course that means that logentry needs to change names as well.

grammar Grammar::Entries {
    token TOP { <logentry>+ }

    token logentry { <loglevel> <logdata> }
    token loglevel { 'DEBUG' | 'WARN' | 'INFO' | 'ERROR' }
    token logdata { <logline>* }
    token logline { '    ' <(\N+)> \n? }
}

I also don't like the redundant log appended to every token.

grammar Grammar::Entries {
    token TOP { <entry>+ }

    token entry { <level> <data> }
    token level { 'DEBUG' | 'WARN' | 'INFO' | 'ERROR' }
    token data { <line>* }
    token line { '    ' <(\N+)> \n? }
}

So what this says is that a Grammar::Entries consist of at least one entry.
An entry starts with a level, and ends with some data.
data consists of any number of lines
A line starts with four spaces, at least one non-newline, and may end with a newline.


The point I'm trying to make is to structure the grammar the same way that the data is structured.

You could even go and add the structure for pulling out the information so that you don't have to do that as a second step.

Circuitry answered 22/5, 2020 at 19:37 Comment(1)
Thanks @BradGilbert, very helpful. I started with a top-bottom approach, but failed already, so I thought I try it again, this time with smaller steps. The first one is to just get all log entries and then proceed further.Induce
R
4

as far as I know <.loglevel> means non-capturing.

It means non-capturing (don't hold onto the match so code can access it later), not non-matching.

What you want to do is match without advancing the match position, a so-called "zero-width assertion". I haven't tested this but expect it to work (famous last words):

grammar Grammar::Entries {
    rule TOP { <logentries>+ }

    token logentries { <loglevel> <logentry> }
    token loglevel { 'DEBUG' | 'WARN' | 'INFO ' | 'ERROR' }
    token logentry { .*? <.finish> }
    token finish { <?loglevel> || $ }     # <-- the change
}
Represent answered 22/5, 2020 at 18:17 Comment(4)
While that does prevent loglevel from progressing the index, it leaves behind a before in $/. So what you want to do is use <.before…> or <?before…>. But at that point why not just use /<?loglevel>/ which does the same thing.Circuitry
Thanks Brad. Fixed. Though I much prefer your answer.Represent
Thanks raiph. Of course you are right. non-capturing is not zero-width assertion. That does ring a belt. However, I will go with Brad's answer, because is a bit more elaborate.Induce
Yeah I tend to answer the question I think should have been asked, rather than the one that was actually asked.Circuitry

© 2022 - 2024 — McMap. All rights reserved.