Parsing Strings with JavaCC
Asked Answered
S

1

5

I'm trying to think of a good way to parse strings using JavaCC without mistakenly matching it to another token. These strings should be able to have spaces, letters, and numbers.

My identifier and number token are as follows:

<IDENTIFIER: (["a"-"z", "A"-"Z"])+>
<NUMBER: (["0"-"9"])+>

My current string token is:

<STRING: "\"" (<IDENTIFIER> | <NUMBERS> | " ")+ "\"">

Ideally, I want to only save the stuff that's inside of the quotes. I have a separate file in which I do the actual saving of variables and values. Should I remove the quotes in there?

I originally had a method in the parser file like this:

variable=<INDENTIFIER> <ASSIGN> <QUOTE> message=<IDENTIFIER> <QUOTE>
{File.saveVariable(variable.image, message.image);}

But, as you might guess, this didn't allow for spaces—or numbers for that matter. For identifiers such as variable names, I only want to allow letters.

So, I'd just like to get some advice on how I could go about capturing string literals. In particular, I'd like to make strings such as:

" hello", "hello ", " hello " and "\nhello", "hello\n", "\nhello\n"

valid in my syntax.

Sperrylite answered 9/8, 2012 at 7:9 Comment(1)
You should accept DerMike's answer - it seems to be pretty flawless.Below
G
12

When passing the first " your parser would like to go into a STRING STATE and leave it upon the next (Bonus: unquoted) ".

Like:

TOKEN:
{
  <QUOTE:"\""> : STRING_STATE
}

<STRING_STATE> MORE:
{
  "\\" : ESC_STATE
}

<STRING_STATE> TOKEN:
{
  <ENDQUOTE:<QUOTE>> : DEFAULT
| <CHAR:~["\"","\\"]>
}

<ESC_STATE> TOKEN:
{
  <CNTRL_ESC:["\"","\\","/","b","f","n","r","t"]> : STRING_STATE
}

You can use this like:

/**
 * Match a quoted string.
 */
String string() :
{
  StringBuilder builder = new StringBuilder();
}
{
  <QUOTE> ( getChar(builder) )* <ENDQUOTE>
  {
    return builder.toString();
  }
}

/**
 * Match char inside quoted string.
 */
void getChar(StringBuilder builder):
{
  Token t;
}
{
  ( t = <CHAR> | t = <CNTRL_ESC> )
  {
    if (t.image.length() < 2)
    {
      // CHAR
      builder.append(t.image.charAt(0));
    }
    else if (t.image.length() < 6)
    {
      // ESC
      char c = t.image.charAt(1);
      switch (c)
      {
        case 'b': builder.append((char) 8); break;
        case 'f': builder.append((char) 12); break;
        case 'n': builder.append((char) 10); break;
        case 'r': builder.append((char) 13); break;
        case 't': builder.append((char) 9); break;
        default: builder.append(c);
      }
    }
  }
}

HTH.

Gillan answered 17/8, 2012 at 18:35 Comment(2)
Thanks for the great solution. It's been 10 years since your answer, but probably you could explain why do you need t.image.length() < 6 condition and why exactly 6?Salpingitis
@YuriiMelnychuk, unfortunately, I don‘t remember. Sorry.Gillan

© 2022 - 2024 — McMap. All rights reserved.