The token of raku grammar doesn't hit the first occurences of a document but hits the similar following occurences
Asked Answered
F

2

7

I want to process the whole Tanach file, in Hebrew. For that, I chose the language Raku because some of its features (grammar and unicode support).

So, I defined some tokens to select the relevant data.

grammar HEB {
        token TOP {'<hebrewname>'<t_word>'</hebrewname>'}
        token t_word {<graph>+}
};

grammar CHA {
        token TOP {'<c n="'<t_number>'">'}
        token t_number {\d+}
};

grammar VER {
        token TOP {'<v n="'<t_number>'">'}
        token t_number {\d+}
};

grammar WOR {
        token TOP {'<w>'<t_word>'</w>'}
        token t_word {<graph>+}
};

Here, a very small part the document (the Tanach in XML format) which is sufficient show the problem :

<names> <name>Genesis</name> <abbrev>Gen</abbrev> <number>1</number> <filename>Genesis</filename> <hebrewname>בראשית</hebrewname> </names> <c n="1"> <v n="1"> <w>בְּ/רֵאשִׁ֖ית</w> <w>בָּרָ֣א</w> <w>אֱלֹהִ֑ים</w> <w>אֵ֥ת</w> <w>הַ/שָּׁמַ֖יִם</w> <w>וְ/אֵ֥ת</w> <w>הָ/אָֽרֶץ׃</w> </v> <v n="2"> <w>וְ/הָ/אָ֗רֶץ</w> <w>הָיְתָ֥ה</w> <w>תֹ֙הוּ֙</w> <w>וָ/בֹ֔הוּ</w> <w>וְ/חֹ֖שֶׁךְ</w> <w>עַל־</w> <w>פְּנֵ֣י</w> <w>תְה֑וֹם</w> <w>וְ/ר֣וּחַ</w> <w>אֱלֹהִ֔ים</w> <w>מְרַחֶ֖פֶת</w> <w>עַל־</w> <w>פְּנֵ֥י</w> <w>הַ/מָּֽיִם׃</w> </v>

The problem is that the code doesn't recognize the two first words (<w>בְּ/רֵאשִׁ֖ית</w> <w>בָּרָ֣א</w> ) but seems to work fine with the following words... Somebody could explain to me what's wrong ?

The main loop is :

for $file_in.lines -> $line {
    $memline = $line.trim;

    if HEB.parse($memline) {
          say "hebrew name of book is "~ $/<t_word>;
          next;
    }
    if CHA.parse($memline) {
        say "chapitre number is "~ $/<t_number>;
        next;
    }
    if VER.parse($memline) {
        say "verse number is "~ $/<t_number>;
        next;
    }
    if WOR.parse($memline) {
        $computed_word_value = 0;
        say "word is "~ $/<t_word>;
        $file_out.print("$/<t_word>");
        say "numbers of graphemes of word is "~ $/<t_word>.chars;
        @exploded_word = $/<t_word>.comb;
        for @exploded_word {
                say $_.uniname;
        };
        next;
    }
    say "not processed";
}

Output file :

Please note that after verse number is 1, the 2 first words are not processed. Don't be focused on the distorted Hebrew (windows console) !

not processed
not processed
not processed
not processed
not processed
hebrew name of book is ׳‘׳¨׳׳©׳™׳×
not processed
chapitre number is 1
verse number is 1
not processed
not processed
word is ׳ײ±׳œײ¹׳”ײ´ײ‘׳™׳
numbers of graphemes of word is 5
HEBREW LETTER ALEF
HEBREW LETTER LAMED
HEBREW LETTER HE
HEBREW LETTER YOD
HEBREW LETTER FINAL MEM
word is ׳ײµײ¥׳×
numbers of graphemes of word is 2
HEBREW LETTER ALEF
HEBREW LETTER TAV
not processed
word is ׳•ײ°/׳ײµײ¥׳×
numbers of graphemes of word is 4
HEBREW LETTER VAV
SOLIDUS

I hope that my question is clearly exposed.

Frei answered 18/2, 2021 at 19:22 Comment(4)
Hi @J, I pasted your code into a repl and got the error "Variable '$memline' is not declared". Please review minimal reproducible example. Thanks.Boudoir
Thanks for your remarks. In fact, i exposed only the problematic part of the code. The variables are declared.Frei
Thanks for replying. You may think your question is adequate but the standard SO response to a question like yours would be to close it to avoid wasting everyone's time. But you are new to SO, and, in addition, we Rakoons are friendly and generous with our time. So we let it be and Brad tried. (Yet he still failed to reproduce your problem, which is very rare when we try.) You said you hoped "my question is clearly exposed". It clearly is not. Please read the page I linked, which is StackOverflow's standard guidance to all who would like to do a good job asking a question, and reconsider.Boudoir
instead of $memline = $line.trim; you can use .subparse() if VER.subparse($line) { though it will match anywhere in a line.Isocyanide
I
7

I can't reproduce your problem.
About the only thing I can guess is that you didn't open the file with the correct encoding.

Or worse, you are getting the file from STDIN and don't have the proper codepage selected. (Which makes sense since your output is also mojibake.)
Rakudo doesn't really do codepages, so if you don't set your environment to utf8 you have to change the encoding of $*STDIN (and $*STDOUT) to match whatever it is.


I'm now going to pretend that you posted to CodeReview.StackExchange.com instead.

First I don't know why you are creating a whole grammar for something so small which could easily be done with simple regexes.

my token HEB {
  '<hebrewname>'
  $<t_word> = [<.graph>+]
  '</hebrewname>'
}
my token CHA {
 '<c n="' $<t_number> = [\d+] '">'
}
my token VER {
  '<v n="' $<t_number> = [\d+] '">'
}
my token WOR {
  '<w>' $<t_word> = [<.graph>+] '</w>'
}

Honestly that is still more than you seem to need, as you only deal with one element per regex.

That's also ignoring that I really dislike that you are giving the elements names like t_word and t_number. Which is pointless as they are inside of $/, and Grammar also doesn't have any such similarly named method so there is no chance of them interfering with any other namespace. Give them descriptive names if you must give them names.

You can just restrict $/ to only stringifying to the part you care about with <(…)>. (It works here because you are only capturing one thing.)

<( means ignore everything before, and )> means ignore everything after.

my token HEB {
  '<hebrewname>'
  <( <.graph>+ )> # $/ will contain only what <.graph>+ matches
  '</hebrewname>'
}
my token CHA {
 '<c n="' <( \d+ )> '">'
}
my token VER {
  '<v n="' <( \d+ )> '">'
}
my token WOR {
  '<w>' <( <.graph>+ )> '</w>'
}

You are parsing it as if it was just a line oriented file.
Which does make a certain amount of sense as it is formatted as one, and that results in less memory usage.

Using named regexes for that, let alone whole grammars is a bit overkill. It also separates the logic when that isn't really necessary for such simple matches.

Here is how I would parse that file in a line oriented fashion:

my $in-names = False;
my %names;
my @chapters;
my @verses;
my @current-verse;

for $file_in.lines {
  when /'<names>' / { $in-names = True  }
  when /'</names>'/ { $in-names = False }

  # chapter
  when /'<c n="' <( \d+ )> '">'/ {
    @verses := @chapters[ +$/ - 1 ] //= [];
  }
  when /'</c>'/ {
    # finalize this chapter
    # for example print out statistics
    # (only needed if you don't want `default` to catch it)
  }

  # verse
  when /'<v n="' <( \d+ )> '">'/ {
    @current-verse := @verses[ +$/ - 1 ] //= [];
  }
  when /'</v>'/ {
    # finalize this verse
  }

  # word
  when /'<w>' <( <.graph>+ )> '</w>'/ {
    push @current-verse, ~$/;
  }

  # name tags
  # must be after more specific regexes
  when /'<' <tag=.ident> '>' $<value> = [<.ident>|\d+] {} "</$<tag>>"/ {
    if $in-names {
      %names{~$<tag>} = ~$<value>
    } else {
      note "not handling $<tag> => $<value> outside of <names>"
    }
  }

  default { note "unexpected text '$_'" }
}

Note that when makes it so that you don't have to do next.
And since we just use $_ instead of $line, it makes it so that we can just use regexes directly as the condition of those when statements.

I'm not bothering to use ^ or $ so there is no need to either trim or use ^\s* and \s*$.
It does make it a bit more fragile, so you may want to change it if it becomes a problem.

If you really want to just do simple line processing like you're doing, I'm sure you can alter the above to suit your needs.

I wanted to make this more useful to people who come across this in the future. So I created a data structure from the file instead of following what you were doing.


Really I probably only would have reached for a grammar if I were going to .parse() the entire file in one go.

This is what such a grammar would look like.

grammar Book {
  rule TOP {
    <names>
    <chapter> +
    # note that there needs to be a space between <chapter> and +
    # so that whitespace can be between <c…>…</c> elements
  }

  rule names {
    '<names>'  ~  '</names>'
    <name> +
  }

  token name {
    '<' <tag=.ident> '>'
    $<name> = [<.ident>|\d+]
    {}
    "</$<tag>>"
  }

  rule chapter {
    # note space before ]
    ['<c n="' <number> '">' ]  ~  '</c>'
    <verse> +
  }
  rule verse {
    ['<v n="' <number> '">' ]  ~  '</v>'
    <word> +
  }

  token number { \d+ }
  token word { '<w>' <( <.graph>+ )> '</w>' }
}

To do similar processing as you have been doing

class Line-Actions {
  has IO::Handle:D $.file-out is required;
  has $!number-type is default<chapter>;

  method name ($/) {
    if $<tag> eq 'hebrewname' {
      say "hebrew name of book is $<name>";
    }
  }

  # note that .chapter and .verse will run at the end
  # of parsing them, which is too late for when .word is processed
  # so we do it in .number instead
  method number ($/) {
    say "$!number-type number is $/";
    $!number-type = 'verse';
  }
  method chapter ($/) {
    # reset to default of "chapter"
    # as the next .number will be for the next chapter
    $!number-type = Nil;
  }

  method word ($/) {
    say "word is $/";
    $!file-out.print(~$/);
    say "number of graphemes in word is $/.chars()";
    .say for "$/".comb.map: *.uninames.join(', ');
  }
}


Book.parsefile(
  $filename,
  actions => Line-Actions.new( 'outfile.txt'.IO.open(:w) )
);
Isocyanide answered 19/2, 2021 at 4:38 Comment(2)
Thanks for you answer and for the time you accorded to me. I need some time to study it. In fact, i began with regex but i thought that the grammar and its features can help to write a more compact code.Frei
@Frei A grammar is just a bunch of regexes put together. The only thing it does is spread out the complexity into a bunch of regexes instead of a single one. It is similar to making a class to deal with complexity. If your regex code isn't complex, turning into a grammar actually makes it more complex. It really doesn't do much to make it more compact. Notice how short VER becomes when it is a single regex /'<v n="'<(\d+)>'">'/.Isocyanide
A
4

Your parsing problem seems to be somewhat confined to the example text you posted, as there appear to be forward-slashes ("solidus" characters) embedded within the snippet of Hebrew text you provided.

The script you provided was easy to fix up, and I re-worked the WOR token in your Raku script to select only <:Script<Hebrew>> unicode. While this may help with stray/embedded "solidus" characters (and other, non-Hebrew characters), presumably you could re-write the script to parse faster. Here's the script:

grammar HEB {
        token TOP {'<hebrewname>'<t_word>'</hebrewname>'}
        token t_word {<graph>+}
};

grammar CHA {
        token TOP {'<c n="'<t_number>'">'}
        token t_number {\d+}
};

grammar VER {
        token TOP {'<v n="'<t_number>'">'}
        token t_number {\d+}
};

grammar WOR {
        token TOP {'<w>'<t_word>'</w>'}
        token t_word {<:Script<Hebrew>>+}
};

for $*ARGFILES.lines -> $line {
    my $memline = $line.trim;

    if HEB.parse($memline) {
          say "hebrew name of book is "~ $/<t_word>;
          next;
    }
    if CHA.parse($memline) {
        say "chapitre number is "~ $/<t_number>;
        next;
    }
    if VER.parse($memline) {
        say "verse number is "~ $/<t_number>;
        next;
    }
    if WOR.parse($memline) {
        say "word is "~ $/<t_word>;
        say "numbers of graphemes of word is "~ $/<t_word>.chars;
        my @exploded_word = $/<t_word>.comb;
        for @exploded_word {
                say $_.uniname, ": ", $_;
        };
        next;
    }
    say "not processed";
}

Starting with a new test file, I was able to get 124655/126663 lines of the following XML text to parse:

http://www.tanach.us/Books/Genesis.xml

Below is the parsed text from lines 103-119 (words which previously had given you problems):

hebrew name of book is בראשית
not processed
chapitre number is 1
verse number is 1
word is בְּרֵאשִׁ֖ית
numbers of graphemes of word is 6
HEBREW LETTER BET: בְּ
HEBREW LETTER RESH: רֵ
HEBREW LETTER ALEF: א
HEBREW LETTER SHIN: שִׁ֖
HEBREW LETTER YOD: י
HEBREW LETTER TAV: ת
word is בָּרָ֣א
numbers of graphemes of word is 3
HEBREW LETTER BET: בָּ
HEBREW LETTER RESH: רָ֣
HEBREW LETTER ALEF: א

HTH.

Adore answered 9/3, 2021 at 3:11 Comment(4)
@BradGilbert Looking at this with fresh eyes, it seems rule would be a better declarator than token for both the CHA (chapter) and VER (verse) grammars, since each target line contain significant whitespace, i.e. :sigspace. I'd love to know your opinion, thx.Adore
In that code, there is no difference as there isn't any unquoted spaces. If you put something in quotes, it matches exactly as a group. It might make sense to split the quoted parts and make it a rule. rule CHA::TOP {'<c' 'n="'<t_number>'">'} which would be the same as token CHA::TOP {'<c' <.ws> 'n="'<t_number>'">'} Although it might be better to be explicit token CHA::TOP {'<c' \s+ 'n="'<t_number>'">'}. I think it is better to default to making token rather than rule especially if you are just learning.Isocyanide
@jubilatious1, Thank's for your response. But Comma shows an error "Missing closing >". What's wrong ? I searched in the Raku doc for some explanation about <:Script<Hebrew>> and didn't find anything.Frei
@Frei The reference is here: docs.raku.org/language/regexes#Unicode_properties and you can search for a cognate example there using <:Script<Latin>> . Also I would try a plaintext editor to confirm your issue, and/or Vim/Emacs. There could be an issue with Right-to-Left encoding and the editor you're using. HTH.Adore

© 2022 - 2024 — McMap. All rights reserved.