Lexing/Parsing "here" documents

Asked 9/9, 2013 at 17:46 Answered 18/9, 2013 at 15:8

For those that are experts in lexing and parsing... I am attempting to write a series of programs in perl that would parse out IBM mainframe z/OS JCL for a variety of purposes, but am hitting a roadblock in methodology. I am mostly following the lexing/parsing ideology put forth in "Higher Order Perl" by Mark Jason Dominus, but there are some things that I can't quite figure out how to do.

JCL has what's called inline data, which is very similar to "here" documents. I am not quite sure how to lex these into tokens.

The layout for inline data is as follows:

//DDNAME   DD *
this is the inline data
this is some more inline data
/*
...

Conventionally, the "*" after the "DD" signifies that following lines are the inline data itself, terminated by either "/*" or the next valid JCL record (starting with "//" in the first 2 columns).

More advanced, the inline data could appear as such:

//DDNAME   DD *,DLM=ZZ
//THIS LOOKS LIKE JCL BUT IT'S ACTUALLY DATA
//MORE DATA MASQUERADING AS JCL
ZZ
...

Sometimes the inline data is itself JCL (perhaps to be pumped to a program or the internal reader, whatever).

But here's the rub. In JCL, the records are 80 bytes, fixed in length. Everything past column 72 (cols 73-80) is a "comment". As well, everything following a blank that follows valid JCL is likewise a comment. Since I am looking to manipulate JCL in my programs and spit it back out, I'd like to capture comments so that I can preserve them.

So, here's an example of inline comments in the case of inline data:

//DDNAME   DD *,DLM=ZZ THIS IS A COMMENT                                COL73DAT
data
...
ZZ
...more JCL

I originally thought that I could have my top-most lexer pull in a line of JCL and immediately create a non-token for cols 1-72 and then a token (['COL73COMMENT',$1]) for the column 73 comment, if any. This would then pass downstream to the next iterator/tokenizer a string of the cols 1-72 text followed by the col73 token.

But how would I, downstream from there, grab the inline data? I'd originally figured that the top-most tokenizer could look for a "DD \*(,DLM=(\S*))" (or the like) and then just keep pulling records from the feeding iterator until it hit the delimiter or a valid JCL starter ("//").

But you may see the issue here... I can't have 2 topmost tokenizers... either the tokenizer that looks for COL73 comments must be the top or the tokenizer that gets inline data must be at the top.

I imagine that perl parsers have the same challenge, since seeing

<<DELIM

isn't necessarily the end of the line, followed by the here document data. After all, you could see perl like:

my $this=$obj->ingest(<<DELIM)->reformat();
inline here document data
more data
DELIM

How would the tokenizer/parser know to tokenize the ")->reformat();" and then still grab the following records as-is? In the case of the inline JCL data, those lines are passed as-is, cols 73-80 are NOT comments in that case...

So, any takers on this? I know there will be tons of questions clarifying my needs and I'm happy to clarify as much as is needed.

Thanks in advance for any help...

Diplomacy answered 9/9, 2013 at 17:46 Comment(4)

The traditional lexer/parser approach only works when your language is context-free. You just need to write parsing code at the right level of abstraction. – Anemology 9/9, 2013 at 18:47

Something to be aware of with JCL is that a /* is not necessary to delimit the end of instream data for DD *, but it is required for DD DATA without the DLM keyword. Also, JES2 JECL uses /* for its commands and JES3 uses //* (yes, the same as a comment). – Verisimilar 9/9, 2013 at 23:16

Excellent points... at this point, I just want to get some basics working with the intent of adding more "tokens" and elements later. JCL is really a hodge-podge language. – Diplomacy 10/9, 2013 at 19:43

You can also skip the DD line and SYSIN DD line will be generated for you – Humic 13/9, 2013 at 13:55

In this answer I will concentrate on heredocs, because the lessons can be easily transferred to the JCL.

Any language that supports heredocs is not context-free, and thus cannot be parsed with common techniques like recursive descent. We need a way to guide the lexer along more twisted paths, but in doing so, we can maintain the appearance of a context-free language. All we need is another stack.

For the parser, we treat introductions to heredocs <<END as string literals. But the lexer has to be extended to do the following:

When a heredoc introduction is encountered, it adds the terminator to the stack.
When a newline is encountered, the body of the heredoc is lexed, until the stack is empty. After that, normal parsing is resumed.

Take care to update the line number appropriately.

In a hand-written combined parser/lexer, this could be implemented like so:

use strict; use warnings; use 5.010;

my $s = <<'INPUT-END'; pos($s) = 0;
<<A <<B
body 1
A
body 2
B
<<C
body 3
C
INPUT-END

my @strs;
push @strs, parse_line() while pos($s) < length($s);
for my $i (0 .. $#strs) {
  say "STRING $i:";
  say $strs[$i];
}

sub parse_line {
  my @strings;
  my @heredocs;

  $s =~ /\G\s+/gc;

  # get the markers
  while ($s =~ /\G<<(\w+)/gc) {
    push @strings, '';
    push @heredocs, [ \$strings[-1], $1 ];
    $s =~ /\G[^\S\n]+/gc;  # spaces that are no newlines
  }

  # lex the EOL
  $s =~ /\G\n/gc or die "Newline expected";

  # process the deferred heredocs:
  while (my $heredoc = shift @heredocs) {
    my ($placeholder, $marker) = @$heredoc;
    $s =~ /\G(.*\n)$marker\n/sgc or die "Heredoc <<$marker expected";
    $$placeholder = $1;
  }

  return @strings;
}

Output:

STRING 0:
body 1

STRING 1:
body 2

STRING 2:
body 3

The Marpa parser simplifies this a bit by allowing events to be triggered once a certain token is parsed. These are called pauses, because the built-in lexing pauses a moment for you to take over. Here is a high-level overview and a short blogpost describing this technique with the demo code on Github.

Doiron answered 9/9, 2013 at 19:16 Comment(3)

Very interesting... I will have to go over your example with a fine-toothed comb, but this looks very do-able. Thanks. – Diplomacy 9/9, 2013 at 19:37

@Diplomacy Oh, I can add explanations if needed. Are the regexes clear? The actual control flow is of course fairly boring in such a contrived example. – Doiron 9/9, 2013 at 19:39

I think it's pretty self-explanatory... I just need to get a real good understanding of it so that I can conceptualize it into my design. All in all, thins sort of thing would be so much easier if things like string literals were not part of the issue. Those tend to get in the way of regexes a lot. – Diplomacy 9/9, 2013 at 19:52

In case anyone was wondering how I decided to resolve this, here is what I did.

My main lexing routine accepts an iterator that pumps full lines of text (which can take it from a file, a string, whatever I want). The routine uses that to create another iterator, which examines the line for "comments" after column 72, which it will then return as a "mainline" token followed by a "col72" token. This iterator is then used to create yet another iterator, which passes the col72 tokens through unchanged, but takes the mainline tokens and lexes them into atomic tokens (things like STRING, NUMBER, COMMA, NEWLINE, etc).

But here's the crux... the lexing routine has the ORIGINAL ITERATOR still... so when it receives a token that indicates there is a "here" document, it continues processing tokens until it hits a NEWLINE token (meaning end of the actual line of text) and then uses the original iterator to pull off the here document data. Since that iterator feeds the atomic tokens iterator, pulling from it then prevents those lines from being atomized.

To illustrate, think of iterators like hoses. The first hose is the main iterator. To that I attach the col72 iterator hose, and to that I attach the atomic tokenizer hose. As streams of characters go in the first hose, atomized tokens come out the end of the third hose. But I can attach a 2-way nozzle to the first hose that will allow its output to come out the alternate nozzle, preventing that data from going into the second hose (and hence the third hose). When I'm done diverting the data through the alternate nozzle, I can turn that off and then data begins flowing through the second and third hoses again.

Easy-peasey.

Diplomacy answered 18/9, 2013 at 15:8 Comment(0)

Recommended topics

Hot tags