Parsing latex-like language in Java
Asked Answered
W

1

3

I'm trying to write a parser in Java for a simple language similar to Latex, i.e. it contains lots of unstructured text with a couple of \commands[with]{some}{parameters} in between. Escape sequences like \\ also have to be taken into account.

I've tried to generate a parser for that with JavaCC, but it looks as if compiler-compilers like JavaCC were only suitable for highly structured code (typical for general-purpose programming languages), not for messy Latex-like markup. So far, it seems I have to go low level and write my own finite state machine.

So my question is, what's the easiest way to parse input that is mostly unstructured, with only a few Latex-like commands in between?

EDIT: Going low level with a finite state machine is difficult because the Latex commands can be nested, e.g. \cmd1{\cmd2{\cmd3{...}}}

Westerfield answered 16/8, 2010 at 16:7 Comment(1)
The canonical resource is Learning to write a compiler. Your problem may well be small enough that a hand-tooled recursive descent approach makes sense. Also, I think you may be conflating lexing and parsing, which could make this seem harder than it is.Godfrey
G
4

You can define a grammar to accept the Latex input, using just characters as tokens in the worst cast. JavaCC should be just fine for this purpose.

The good thing about a grammar and a parser generator is that it can parse things that FSAs have trouble with, especially nested structures.

A first cut at your grammar could be (I'm not sure this is valid JavaCC, but it is reasonable EBNF):

 Latex = item* ;
 item = command | rawtext ;
 command =  command arguments ;
 command = '\' letter ( letter | digit )* ;  -- might pick this up as lexeme
 letter = 'a' | 'b' | ... | 'z' ;
 digit= '0' | ...  | '9' ;
 arguments =  epsilon |  '{' item* '}' ;
 rawtext = ( letter | digit | whitespace | punctuationminusbackslash )+ ; -- might pick this up as lexeme
 whitespace = ' ' | '\t' | '\n' | '\:0D' ; 
 punctuationminusbackslash = '!' | ... | '^' ;
Going answered 19/8, 2010 at 16:37 Comment(2)
Yes, this looks like a valid solution. But I'm wondering if splitting up the text into single-character tokens is bad performance-wise...Westerfield
@python dude: unless your latex files are huge, I doubt this matters much. What you asked for was the "easiest" way to do this, this is it! If you want to make it faster, you can implement some of the nonterminals (rawtext, etc. ) as more traditional lexemes. I've modified the grammar slightly to make that easier.Going

© 2022 - 2024 — McMap. All rights reserved.