How can I define a Raku grammar to parse TSV text?
Asked Answered
S

3

14

I have some TSV data

ID     Name    Email
   1   test    [email protected]
 321   stan    [email protected]

I would like to parse this into a list of hashes

@entities[0]<Name> eq "test";
@entities[1]<Email> eq "[email protected]";

I'm having trouble with using the newline metacharacter to delimit the header row from the value rows. My grammar definition:

use v6;

grammar Parser {
    token TOP       { <headerRow><valueRow>+ }
    token headerRow { [\s*<header>]+\n }
    token header    { \S+ }
    token valueRow  { [\s*<value>]+\n? }
    token value     { \S+ }
}

my $dat = q:to/EOF/;
ID     Name    Email
   1   test    [email protected]
 321   stan    [email protected]
EOF
say Parser.parse($dat);

But this is returning Nil. I think I'm misunderstanding something fundamental about regexes in raku.

Shaeshaef answered 3/3, 2020 at 15:35 Comment(2)
Nil. It's pretty barren as far as feedback goes, right? For debugging, download commaide if you haven't already, and/or see How can error reporting in grammars be improved?. You got Nil cuz your pattern assumed backtracking semantics. See my answer about that. I recommend you eschew backtracking. See @user0721090601's answer about that. For sheer practicality and speed, see JJ's answer. Also, Introductory general answer to "I want to parse X with Raku. Can anyone help?".Arctic
use Grammar::Tracer; #works for meLytton
D
13

Probably the main thing that's throwing it off is that \s matches horizontal and vertical space. To match just horizontal space, use \h, and to match just vertical space, \v.

One small recommendation I'd make is to avoid including the newlines in the token. You might also want to use the alternation operators % or %%, as they're designed for handling this type work:

grammar Parser {
    token TOP       { 
                      <headerRow>     \n
                      <valueRow>+ %%  \n
                    }
    token headerRow { <.ws>* %% <header> }
    token valueRow  { <.ws>* %% <value>  }
    token header    { \S+ }
    token value     { \S+ }
    token ws        { \h* }
} 

The result of Parser.parse($dat) for this is the following:

「ID     Name    Email
   1   test    [email protected]
 321   stan    [email protected]
」
 headerRow => 「ID     Name    Email」
  header => 「ID」
  header => 「Name」
  header => 「Email」
 valueRow => 「   1   test    [email protected]」
  value => 「1」
  value => 「test」
  value => 「[email protected]」
 valueRow => 「 321   stan    [email protected]」
  value => 「321」
  value => 「stan」
  value => 「[email protected]」
 valueRow => 「」

which shows us that the grammar has successfully parsed everything. However, let's focus on the second part of your question, that you want to it to be available in a variable for you. To do that, you'll need to supply an actions class which is very simple for this project. You just make a class whose methods match the methods of your grammar (although very simple ones, like value/header that don't require special processing besides stringification, can be ignored). There are some more creative/compact ways to handle processing of yours, but I'll go with a fairly rudimentary approach for illustration. Here's our class:

class ParserActions {
  method headerRow ($/) { ... }
  method valueRow  ($/) { ... }
  method TOP       ($/) { ... }
}

Each method has the signature ($/) which is the regex match variable. So now, let's ask what information we want from each token. In header row, we want each of the header values, in a row. So:

  method headerRow ($/) { 
    my   @headers = $<header>.map: *.Str
    make @headers;
  }

Any token with a quantifier on it will be treated as a Positional, so we could also access each individual header match with $<header>[0], $<header>[1], etc. But those are match objects, so we just quickly stringify them. The make command allows other tokens to access this special data that we've created.

Our value row will look identically, because the $<value> tokens are what we care about.

  method valueRow ($/) { 
    my   @values = $<value>.map: *.Str
    make @values;
  }

When we get to last method, we will want to create the array with hashes.

  method TOP ($/) {
    my @entries;
    my @headers = $<headerRow>.made;
    my @rows    = $<valueRow>.map: *.made;

    for @rows -> @values {
      my %entry = flat @headers Z @values;
      @entries.push: %entry;
    }

    make @entries;
  }

Here you can see how we access the stuff we processed in headerRow() and valueRow(): You use the .made method. Because there are multiple valueRows, to get each of their made values, we need to do a map (this is a situation where I tend to write my grammar to have simply <header><data> in the grammar, and defeine the data as being multiple rows, but this is simple enough it's not too bad).

Now that we have the headers and rows in two arrays, it's simply a matter of making them an array of hashes, which we do in the for loop. The flat @x Z @y just intercolates the elements, and the hash assignment Does What We Mean, but there are other ways to get the array in hash you want.

Once you're done, you just make it, and then it will be available in the made of the parse:

say Parser.parse($dat, :actions(ParserActions)).made
-> [{Email => [email protected], ID => 1, Name => test} {Email => [email protected], ID => 321, Name => stan} {}]

It's fairly common to wrap these into a method, like

sub parse-tsv($tsv) {
  return Parser.parse($tsv, :actions(ParserActions)).made
}

That way you can just say

my @entries = parse-tsv($dat);
say @entries[0]<Name>;    # test
say @entries[1]<Email>;   # [email protected]
Doriandoric answered 3/3, 2020 at 16:55 Comment(4)
I think I would write the actions class different. class Actions { has @!header; method headerRow ($/) { @!header = @<header>.map(~*); make @!header.List; }; method valueRow ($/) {make (@!header Z=> @<value>.map: ~*).Map}; method TOP ($/) { make @<valueRow>.map(*.made).List } You of course would have to instantiate it first :actions(Actions.new).Enrage
@BradGilbert yeah, I tend to write my actions classes to avoid instantiation, but if instantiating, I'd probably do class Actions { has @!header; has %!entries … } and just have the valueRow add the entries directly so that you end up with just method TOP ($!) { make %!entries }. But this is Raku after all and TIMTOWTDI :-)Doriandoric
From reading this info (docs.raku.org/language/regexes#Modified_quantifier:_%,_%%), I think that I understand <valueRow>+ %% \n (Capture rows that are delimited by newlines), but following that logic, <.ws>* %% <header> would be "capture optional whitespace that is delimited by non-whitespace". Am I missing something?Datary
@ChristopherBottoms almost. The <.ws> doesn't capture (<ws> would). The OP noted that the TSV format may begin with an optional whitespace. In reality, this would probably be even better defined with a line-spacing token defined as \h*\n\h*, which would allow for the valueRow to be defined more logically as <header> % <.ws>Doriandoric
D
12

TL;DR: you don't. Just use Text::CSV, which is able to deal with every format.

I will show how old Text::CSV will probably be useful:

use Text::CSV;

my $text = q:to/EOF/;
ID  Name    Email
   1    test    [email protected]
 321    stan    [email protected]
EOF
my @data = $text.lines.map: *.split(/\t/).list;

say @data.perl;

my $csv = csv( in => @data, key => "ID");

print $csv.perl;

The key part here is the data munging that converts the initial file into an array or arrays (in @data). It's only needed, however, because the csv command is not able to deal with strings; if data is in a file, you're good to go.

The last line will print:

${"   1" => ${:Email("test\@email.com"), :ID("   1"), :Name("test")}, " 321" => ${:Email("stan\@nowhere.net"), :ID(" 321"), :Name("stan")}}%

The ID field will become the key to the hash, and the whole thing an array of hashes.

Dirkdirks answered 3/3, 2020 at 18:21 Comment(2)
Upvoting because of practicality. I'm not sure, though, if the OP is aiming more to learn grammars (my answer's approach) or just needing to parse (your answer's approach). In either case, he should be good to go :-)Doriandoric
Upvoted for the same reason. :) I had thought the OP might be aiming to learn what they'd done wrong in terms of regex semantics (hence my answer), aiming to learn how to do it right (your answer), or just needing to parse (JJ's answer). Team work. :)Arctic
A
8

TL;DR regexs backtrack. tokens don't. That's why your pattern isn't matching. This answer focuses on explaining that, and how to trivially fix your grammar. However, you should probably rewrite it, or use an existing parser, which is what you should definitely do if you just want to parse TSV rather than learn about raku regexes.

A fundamental misunderstanding?

I think I'm misunderstanding something fundamental about regexes in raku.

(If you already know the term "regexes" is a highly ambiguous one, consider skipping this section.)

One fundamental thing you may be misunderstanding is the meaning of the word "regexes". Here are some popular meanings folk assume:

  • Formal regular expressions.

  • Perl regexes.

  • Perl Compatible Regular Expressions (PCRE).

  • Text pattern matching expressions called "regexes" that look like any of the above and do something similar.

None of these meanings are compatible with each other.

While Perl regexes are semantically a superset of formal regular expressions, they are far more useful in many ways but also more vulnerable to pathological backtracking.

While Perl Compatible Regular Expressions are compatible with Perl in the sense they were originally the same as standard Perl regexes in the late 1990s, and in the sense that Perl supports pluggable regex engines including the PCRE engine, PCRE regex syntax is not identical to the standard Perl regex used by default by Perl in 2020.

And while text pattern matching expressions called "regexes" generally do look somewhat like each other, and do all match text, there are dozens, perhaps hundreds, of variations in syntax, and even in semantics for the same syntax.

Raku text pattern matching expressions are typically called either "rules" or "regexes". The use of the term "regexes" conveys the fact that they look somewhat like other regexes (although the syntax has been cleaned up). The term "rules" conveys the fact they are part of a much broader set of features and tools that scale up to parsing (and beyond).

The quick fix

With the above fundamental aspect of the word "regexes" out of the way, I can now turn to the fundamental aspect of your "regex"'s behavior.

If we switch three of the patterns in your grammar for the token declarator to the regex declarator, your grammar works as you intended:

grammar Parser {
    regex TOP       { <headerRow><valueRow>+ }
    regex headerRow { [\s*<header>]+\n }
    token header    { \S+ }
    regex valueRow  { [\s*<value>]+\n? }
    token value     { \S+ }
}

The sole difference between a token and a regex is that a regex backtracks whereas a token doesn't. Thus:

say 'ab' ~~ regex { [ \s* a  ]+ b } # 「ab」
say 'ab' ~~ token { [ \s* a  ]+ b } # 「ab」
say 'ab' ~~ regex { [ \s* \S ]+ b } # 「ab」
say 'ab' ~~ token { [ \s* \S ]+ b } # Nil

During processing of the last pattern (that could be and often is called a "regex", but whose actual declarator is token, not regex), the \S will swallow the 'b', just as it temporarily will have done during processing of the regex in the prior line. But, because the pattern is declared as a token, the rules engine (aka "regex engine") does not backtrack, so the overall match fails.

That's what's going on in your OP.

The right fix

A better solution in general is to wean yourself from assuming backtracking behavior, because it can be slow and even catastrophically slow (indistinguishable from the program hanging) when used in matching against a maliciously constructed string or one with an accidentally unfortunate combination of characters.

Sometimes regexs are appropriate. For example, if you're writing a one-off and a regex does the job, then you're done. That's fine. That's part of the reason that / ... / syntax in raku declares a backtracking pattern, just like regex. (Then again you can write / :r ... / if you want to switch on ratcheting -- "ratchet" means the opposite of "backtrack", so :r switches a regex to token semantics.)

Occasionally backtracking still has a role in a parsing context. For example, while the grammar for raku generally eschews backtracking, and instead has hundreds of rules and tokens, it nevertheless still has 3 regexs.


I've upvoted @user0721090601++'s answer because it's useful. It also addresses several things that immediately seemed to me to be idiomatically off in your code, and, importantly, sticks to tokens. It may well be the answer you prefer, which will be cool.

Arctic answered 3/3, 2020 at 17:30 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.