I want to process the whole Tanach file, in Hebrew. For that, I chose the language Raku because some of its features (grammar and unicode support).
So, I defined some tokens to select the relevant data.
grammar HEB {
token TOP {'<hebrewname>'<t_word>'</hebrewname>'}
token t_word {<graph>+}
};
grammar CHA {
token TOP {'<c n="'<t_number>'">'}
token t_number {\d+}
};
grammar VER {
token TOP {'<v n="'<t_number>'">'}
token t_number {\d+}
};
grammar WOR {
token TOP {'<w>'<t_word>'</w>'}
token t_word {<graph>+}
};
Here, a very small part the document (the Tanach in XML format) which is sufficient show the problem :
<names>
<name>Genesis</name>
<abbrev>Gen</abbrev>
<number>1</number>
<filename>Genesis</filename>
<hebrewname>בראשית</hebrewname>
</names>
<c n="1">
<v n="1">
<w>בְּ/רֵאשִׁ֖ית</w>
<w>בָּרָ֣א</w>
<w>אֱלֹהִ֑ים</w>
<w>אֵ֥ת</w>
<w>הַ/שָּׁמַ֖יִם</w>
<w>וְ/אֵ֥ת</w>
<w>הָ/אָֽרֶץ׃</w>
</v>
<v n="2">
<w>וְ/הָ/אָ֗רֶץ</w>
<w>הָיְתָ֥ה</w>
<w>תֹ֙הוּ֙</w>
<w>וָ/בֹ֔הוּ</w>
<w>וְ/חֹ֖שֶׁךְ</w>
<w>עַל־</w>
<w>פְּנֵ֣י</w>
<w>תְה֑וֹם</w>
<w>וְ/ר֣וּחַ</w>
<w>אֱלֹהִ֔ים</w>
<w>מְרַחֶ֖פֶת</w>
<w>עַל־</w>
<w>פְּנֵ֥י</w>
<w>הַ/מָּֽיִם׃</w>
</v>
The problem is that the code doesn't recognize the two first words (<w>בְּ/רֵאשִׁ֖ית</w>
<w>בָּרָ֣א</w>
) but seems to work fine with the following words...
Somebody could explain to me what's wrong ?
The main loop is :
for $file_in.lines -> $line {
$memline = $line.trim;
if HEB.parse($memline) {
say "hebrew name of book is "~ $/<t_word>;
next;
}
if CHA.parse($memline) {
say "chapitre number is "~ $/<t_number>;
next;
}
if VER.parse($memline) {
say "verse number is "~ $/<t_number>;
next;
}
if WOR.parse($memline) {
$computed_word_value = 0;
say "word is "~ $/<t_word>;
$file_out.print("$/<t_word>");
say "numbers of graphemes of word is "~ $/<t_word>.chars;
@exploded_word = $/<t_word>.comb;
for @exploded_word {
say $_.uniname;
};
next;
}
say "not processed";
}
Output file :
Please note that after verse number is 1, the 2 first words are not processed. Don't be focused on the distorted Hebrew (windows console) !
not processed
not processed
not processed
not processed
not processed
hebrew name of book is ׳‘׳¨׳׳©׳™׳×
not processed
chapitre number is 1
verse number is 1
not processed
not processed
word is ׳ײ±׳œײ¹׳”ײ´ײ‘׳™׳
numbers of graphemes of word is 5
HEBREW LETTER ALEF
HEBREW LETTER LAMED
HEBREW LETTER HE
HEBREW LETTER YOD
HEBREW LETTER FINAL MEM
word is ׳ײµײ¥׳×
numbers of graphemes of word is 2
HEBREW LETTER ALEF
HEBREW LETTER TAV
not processed
word is ׳•ײ°/׳ײµײ¥׳×
numbers of graphemes of word is 4
HEBREW LETTER VAV
SOLIDUS
I hope that my question is clearly exposed.
$memline = $line.trim;
you can use.subparse()
if VER.subparse($line) {
though it will match anywhere in a line. – Isocyanide