Can JavaCC distinguish token by its context?

Asked 22/12, 2015 at 3:4 Answered 22/12, 2015 at 20:49

Solved java compiler-construction token javacc

Basic requirement is use keyword as identifier, so I want to distinguish the token from it's context.(e.g.class is a keyword, but we allowed a variable named class).

In java, this is possible, but it's so hard, here is how I do it

TOKEN :
{
    <I_CAL:     "CAL">  : DO_CAL
    | <I_CALL:  "CALL">
    | <I_CMP:   "CMP">
    | <I_EXIT:  "EXIT">
    | <I_IN:    "IN">
    | <I_JMP:   "JMP">
    | <I_JPC:   "JPC">  : NEED_CMP_OP
    | <I_LD:    "LD">   : NEED_DATA_TYPE
    | <I_NOP:   "NOP">
    | <I_OUT:   "OUT">
    | <I_POP:   "POP">
    | <I_PUSH:  "PUSH">
    | <I_RET:   "RET">
    | <I_DATA:  "DATA"> : DO_DATA
    | <I_BLOCK:  ".BLOCK">
}

// T prefix for Token
TOKEN :
{
    <T_REGISTER : "R0" | "R1" | "R2" | "R3" | "RP" | "RF" |"RS" | "RB">
// We need below TOKEN in special context, other wise they are just IDENTIFIER
//    | <DATA_TYPE: "DWORD" | "WORD" | "BYTE" | "FLOAT" | "INT">
//    | <PSEUDO_DATA_TYPE: "CHAR" >
//    | <CAL_OP: "ADD" | "SUB" | "MUL" | "DIV" | "MOD">
//    | <CMP_OP: "Z" | "B" | "BE" | "A" | "AE" | "NZ">
    | <T_LABEL: <IDENTIFIER> ([" "])* <COLON>>
}

// Now we need a CMP OP
<NEED_CMP_OP> TOKEN:
{
    <CMP_OP: "Z" | "B" | "BE" | "A" | "AE" | "NZ"> : DEFAULT
}
// Now we need a DATA TYPE
<NEED_DATA_TYPE,DO_CAL> TOKEN:
{
    // EXTENSION Add char to data type
    <DATA_TYPE: "DWORD" | "WORD" | "BYTE" | "FLOAT" | "INT" | "CHAR"> {
        if(curLexState == DO_CAL){
            SwitchTo(NEED_CAL_OP);
        }else{
            SwitchTo(DEFAULT);
        }
    }
}
// We need a CAL OP
<NEED_CAL_OP> TOKEN:
{
    <CAL_OP: "ADD" | "SUB" | "MUL" | "DIV" | "MOD"> : DEFAULT
}
// Aslo need to skip the empty
<NEED_DATA_TYPE,NEED_CAL_OP,NEED_CMP_OP,DO_CAL,DO_DATA> SKIP:
{
    " "
|   "\t"
|   "\r"
|   "\f"
}

Source is here, I can distinguish the token from context by curLexState.

It is works, but fussy to do, need to add a lot extra state, and maintain a lot states.Is there any easy way to achieve this ?

Arbutus answered 22/12, 2015 at 3:4 Comment(5)

Separating the lexer and the parser (which makes grammars much easier) requires that you have keywords. Your solution is basically undoing that distinction by putting knowledge of the grammar of the language back into the lexer. I don't think that there is an "easier" solution. – Decca 22/12, 2015 at 3:27

This is always going to be very difficult with any form of generated parser. You can accomplish it by hand but even then it's really not easy. Try to remove this requirement, it's decades out of date. – Bonsai 22/12, 2015 at 3:39

@EJP I try to implement a parse for exists simple ASM. – Arbutus 22/12, 2015 at 4:0

@ErwinBolwidt Maybe I should change the grammar-ly token to grammar, that make a lot sense ? – Arbutus 22/12, 2015 at 4:1

@Arbutus Keywords in assembly languages are distinguished by the column they are in. – Bonsai 22/12, 2015 at 6:4

There are three ways to do this outlined in the JavaCC FAQ.

One is to use lexical states, as you have done. This method can be tricky, but it's the only way to deal with situations where the length of the longest match depends on context or where rules for skipping depend on context. For your problem, it is probably more complex than you need.
A second is to use one token kind, and to use semantic lookahead based on the token image to get the parser to treat some token's specially in some circumstances. See the FAQ for more.
The third (and usually easiest) approach is to make distinctions at the lexical level and then ignore the distinctions at the syntactic level. This is usually the best way to deal with keywords that can double as identifiers.

Below I'll give three examples of the third approach.

Using keywords as identifiers

If all you want to do is to allow the keyword class to be used as a variable name, there is a very simple way to do this. In the lexer put in the usual rules.

TOKEN: { <CLASS: "class"> }
TOKEN: { < VARNAME: ["a-"z","A"-Z"](["a-"z","A"-Z"])* > } // Or what you will

In the parser write a production

Token varName() { Token t ; } : {
{
    (t = <CLASS> | t = <VARNAME>)
    {return t ;}
}

Then use varName() elsewhere in the parser.

The original poster's assembler

Turning to the assembler example in the original question, let's look at the JPC instruction as an example. The JPC (Jump conditional) instruction is followed by a comparison operator such as Z, B, etc and then an operand that can be a number of things including identifiers. E.g. we could have

JPC Z fred

But we could also have an identifier named JPC or Z, so

JPC Z JPC

and

JPC Z Z

are also a valid JPC instructions.

In the lexical part we have

TOKEN : // Opcodes
{
    <I_CAL: "CAL"> 
|   <I_JPC: "JPC"> 
|   ... // other op codes
    <CMP_OP: "Z" | "B" | "BE" | "A" | "AE" | "NZ">
|   <T_REGISTER : "R0" | "R1" | "R2" | "R3" | "RP" | "RF" |"RS" | "RB">
}
... // Other lexical rules.

TOKEN : // Be sure this rule comes after all keywords.
{
    < IDENTIFIER: <LETTER> (<LETTER>|<DIGIT>)* >
}

In the parser we have

Instruction Instruction():{
    Instruction inst = new Instruction();
    Token o = null,dataType = null,calType = null,cmpType = null;
    Operand a = null,b = null; }
{
    ...
    o = <I_JPC> cmpType = <CMP_OP> a = Operand()
    ...
}

Operand Operand():{
    Token t ; ... }
{
     t = <T_REGISTER> ...
|    t = Identifier()  ...
    ...
}

Token Identifier : {
    Token t ; }
{
    t = <IDENTIFIER> {return t ;}
|   t = <I_CAL>      {return t ;}
|   t = <I_JPC>      {return t ;}
|   t = <CMP_OP>     {return t ;}
| ... // All other keywords
}

I would suggest excluding register names from the list of other keywords that could be used as identifiers.

If you do include <T_REGISTER> in that list, then there will be an ambiguity in operand because Operand looks like this

Operand Operand():{
    Token t ; ... }
{
     t = <T_REGISTER> ...
|    t = Identifier()  ...
    ...
}

Now there is an ambiguity because

JPC Z R0

has two parses. In the context of being an operand, we want tokens like "R0" to be parsed as registers not identifiers. Luckly JavaCC will prefer earlier choices, so this is exactly what will happen. You will get a warning from JavaCC. You can ignore the warning. (I add a comment to my source code so that other programmers don't worry.) Or you can suppress the warning with a lookahead specification.

Operand Operand():{
    Token t ; ... }
{
     LOOKAHEAD(1) t = <T_REGISTER> ...
|    t = Identifier()  ...
    ...
}

Using right context

So far all the examples have used left context. I.e. we can tell how to treat a token based solely on the sequence of tokens to its left. Let's look at a case where the interpretation of a keyword is based on the tokens to the right.

Consider this simple imperative language in which all the keywords can be used as variable names.

P -> Block <EOF>
Block -> [S Block]
S -> Assignment | IfElse
Assignment -> LHS ":=" Exp
LHS -> VarName
IfElse -> "if" Exp Block ["else" Block] "end"
Exp -> VarName
VarName -> <ID> | if | else | end

This grammar is unambiguous. You can make the grammar more complicated by adding new kinds of statements, expressions and left-hand sides; as long as the grammar stays unambiguous, such complications probably won't make much difference to what I'm going to say next. Feel free to experiment.

The grammar is not LL(1). There are two places where a choice must be made based on more than one future token. One is the choice between Assignment and IfElse when the next token is "if". Consider the block

a := b
if := a

a := b
if q
    b := c
end

We can look ahead for a ":=" like this

void S() : {} {
    LOOKAHEAD( LHS() ":=" ) Assignment()
|
    IfElse() 
}

The other place we need to look ahead is when an "else" or an "end" is encountered at the start of a Block. Consider

if x
    end := y
    else := z
end

We can solve this with

void Block() : {} {
    LOOKAHEAD( LHS() ":=" | "if" ) S() Block()
|
    {}
}

Franciscka answered 22/12, 2015 at 20:49 Comment(10)

How does ANTLR handle the implied ambiguity? What parse tree does it build? – Yann 23/12, 2015 at 2:9

@IraBaxter I'm not sure about ANTLR 3 and 4, although they are probably similar to JavaCC. In JavaCC, there is no ambiguity at the lexical level because of the the "first match rule", so "class" is always a <CLASS> and never a <VARNAME>. Ambiguity at the syntactic level could arise if you had a choice between <CLASS> and varName(). JavaCC would warn of the ambiguity. The programmer can resolve the ambiguity with a syntactic lookahead specification (syntactic predicate for ANTLR) or live with the warning. If you use JJTree, the parse tree (by default) follows the grammar. – Franciscka 24/12, 2015 at 13:12

I don't understand how the answer you provided chooses the correct interpretation of the token in the general case. Arbitrary lookahead might be required; but to distinguish in that case ANTLR must carry the possibilities of two different parses up to the point where the predicate can run. How can it do that? – Yann 24/12, 2015 at 17:32

I might not understand the question. As I said above, there in no ambiguity at the lexical level. There is only one interpretation of each token. I.e. for any given input string there is at most one way to break it into tokens and each token will only be assigned one kind. This is true of any JavaCC lexical specification unless you get really crazy with lexical states. I believe ANTLR is similar. So I'm not sure what you mean by the "correct interpretation". – Franciscka 26/12, 2015 at 1:28

If the "token" is interpreted always by the lexer as the keyword "class", then it cannot be used as an identifier, and vice-versa. One might (I think your code does this) ask the parser for keyword or identifier for class in its present state, and if only one is acceptable with the present left context, it might be able to choose correctly. ... – Yann 26/12, 2015 at 4:26

... But if the correct interpretation of the token depends upon some right context, then ANTLR must choose one or the other at the point where it sees the token, and then one can easily construct an adversary input stream where that choice is always wrong. Maybe I misunderstand how ANTLR works, but it seems to me based on this argument that it can't choose correctly. What am I missing? [GLR parsers succeed here because they can carry both interpretations for arbitrarily long right context.] – Yann 26/12, 2015 at 4:26

The OP points out that their present problem can be solved with lexical states, i.e. left context, and is asks for an easy way to solve the same problem. My answer provides another way to solve the same problem; I consider it an easy way. @IraBaxter is looking for something else: a solution when right context also needs to be considered. In JavaCC the way to do this is with syntactic lookahead. In ANTLR it is with syntactic predicates. Does this solution work in every conceivable situation? Probably not. But in most situations that come up in practice it seems to work well. – Franciscka 26/12, 2015 at 18:15

So to follow on my last comment (I ran out of room): Dealing with right context is really a different question. If anyone really wants a particular example illustrated, they should ask a new question. Trying to give a full example and its solution in a comment is very awkward. – Franciscka 26/12, 2015 at 23:53

I note that OP's title question does not restrict the context in which his keywords can be used. His actual question detail content happens to restrict the context quite a bit. – Yann 27/12, 2015 at 2:1

Since the title might lead people to expect right as well as left context, I've added a section to the answer that addresses right context with a simple example. There are still two other approaches that I know of: lexical states and semantic lookahead that this answer doesn't address. So it's still not a complete answer. One could write a lot on the topic. – Franciscka 31/12, 2015 at 17:41

If you merge the lexer and parser into a character-oriented parser, then it is relatively easy to distinguish keywords in context, because the parser is all about retaining context. You could operate JavaCC on character tokens to achieve this effect, but its LL nature would probably make it impossible to write practical grammars for other reasons.

If you separate lexer and parser, this isn't easy.

You are asking the lexer to know when something is an identifier or a keyword, which it can only do by knowing the context which the Id/keyword is found.

Ideally the lexer would simply ask the parser for its state, and that would identify the contexts in which the choice is made. That's hard to organize; most parsers aren't designed to reveal their state easily or in a form easy to interpret for extracting the context signal needed. JavaCC isn't obviously organized this way.

Your other obvious choice is to model the different contexts as states in the lexer, with transitions between lexing states corresponding to transitions between interesting contexts. This may or may not be easy depending the context. If you can do it, you have to code the states and the transitions in your lexer and keep them up to date. When you can do this "easily", it is not a bad solution. This can be hard or impossible depending the specific contexts.

For OPs purpose (apparantly a parser for an assembler), the context is usually determined by the position within the source line. One can qualitatively divide assembler input into Label, Opcode, Operand, Comment contexts by watching whitespace: A newline sets the context to Label, whitespace in Label mode sets context to Opcode, whitespace in Opcode sets Operand context, and whitespace in Operand context sets Comment context. With these state transitions, one can write different sublexers for each context, thus having different keywords in each subcontext.

This trick doesn't work for languages like PL/I, which have vast numbers of keywords in context (for PL/I, in fact, every keyword is only in context!).

A non-obvious choices is to not try to differentiate at all. When an Id/Keyword is found, feed both tokens to the parser, and let it sort out which one leads to a viable parse. (Note: it may have handle the cross product of multiple ambiguous tokens, thus many possible parses while sorting this out.) This requires a parser that can handle ambiguity, both while parsing, and in the tokens it accepts (or it can't accept both an ID and a Keyword token at the same time). This is a beautifully simple solution to use when you have the right parsing machinery. JavaCC isn't that machinery.

[See my bio for a GLR parsing engine in which all 3 solutions are easily accessible. It handles Pl/I easily.

Yann answered 22/12, 2015 at 5:42 Comment(0)

There are three ways to do this outlined in the JavaCC FAQ.

One is to use lexical states, as you have done. This method can be tricky, but it's the only way to deal with situations where the length of the longest match depends on context or where rules for skipping depend on context. For your problem, it is probably more complex than you need.
A second is to use one token kind, and to use semantic lookahead based on the token image to get the parser to treat some token's specially in some circumstances. See the FAQ for more.
The third (and usually easiest) approach is to make distinctions at the lexical level and then ignore the distinctions at the syntactic level. This is usually the best way to deal with keywords that can double as identifiers.

Below I'll give three examples of the third approach.

Using keywords as identifiers

If all you want to do is to allow the keyword class to be used as a variable name, there is a very simple way to do this. In the lexer put in the usual rules.

TOKEN: { <CLASS: "class"> }
TOKEN: { < VARNAME: ["a-"z","A"-Z"](["a-"z","A"-Z"])* > } // Or what you will

In the parser write a production

Token varName() { Token t ; } : {
{
    (t = <CLASS> | t = <VARNAME>)
    {return t ;}
}

Then use varName() elsewhere in the parser.

The original poster's assembler

JPC Z fred

But we could also have an identifier named JPC or Z, so

JPC Z JPC

and

JPC Z Z

are also a valid JPC instructions.

In the lexical part we have

TOKEN : // Opcodes
{
    <I_CAL: "CAL"> 
|   <I_JPC: "JPC"> 
|   ... // other op codes
    <CMP_OP: "Z" | "B" | "BE" | "A" | "AE" | "NZ">
|   <T_REGISTER : "R0" | "R1" | "R2" | "R3" | "RP" | "RF" |"RS" | "RB">
}
... // Other lexical rules.

TOKEN : // Be sure this rule comes after all keywords.
{
    < IDENTIFIER: <LETTER> (<LETTER>|<DIGIT>)* >
}

In the parser we have

Instruction Instruction():{
    Instruction inst = new Instruction();
    Token o = null,dataType = null,calType = null,cmpType = null;
    Operand a = null,b = null; }
{
    ...
    o = <I_JPC> cmpType = <CMP_OP> a = Operand()
    ...
}

Operand Operand():{
    Token t ; ... }
{
     t = <T_REGISTER> ...
|    t = Identifier()  ...
    ...
}

Token Identifier : {
    Token t ; }
{
    t = <IDENTIFIER> {return t ;}
|   t = <I_CAL>      {return t ;}
|   t = <I_JPC>      {return t ;}
|   t = <CMP_OP>     {return t ;}
| ... // All other keywords
}

I would suggest excluding register names from the list of other keywords that could be used as identifiers.

If you do include <T_REGISTER> in that list, then there will be an ambiguity in operand because Operand looks like this

Operand Operand():{
    Token t ; ... }
{
     t = <T_REGISTER> ...
|    t = Identifier()  ...
    ...
}

Now there is an ambiguity because

JPC Z R0

Operand Operand():{
    Token t ; ... }
{
     LOOKAHEAD(1) t = <T_REGISTER> ...
|    t = Identifier()  ...
    ...
}

Using right context

Consider this simple imperative language in which all the keywords can be used as variable names.

P -> Block <EOF>
Block -> [S Block]
S -> Assignment | IfElse
Assignment -> LHS ":=" Exp
LHS -> VarName
IfElse -> "if" Exp Block ["else" Block] "end"
Exp -> VarName
VarName -> <ID> | if | else | end

a := b
if := a

a := b
if q
    b := c
end

We can look ahead for a ":=" like this

void S() : {} {
    LOOKAHEAD( LHS() ":=" ) Assignment()
|
    IfElse() 
}

The other place we need to look ahead is when an "else" or an "end" is encountered at the start of a Block. Consider

if x
    end := y
    else := z
end

We can solve this with

void Block() : {} {
    LOOKAHEAD( LHS() ":=" | "if" ) S() Block()
|
    {}
}

Franciscka answered 22/12, 2015 at 20:49 Comment(10)

How does ANTLR handle the implied ambiguity? What parse tree does it build? – Yann 23/12, 2015 at 2:9

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Using keywords as identifiers

The original poster's assembler

Using right context

Using keywords as identifiers

The original poster's assembler

Using right context

Recommended topics

Hot tags