Class hierarchy of tokens and checking their type in the parser
Asked Answered
W

1

11

I'm attempting to write a reusable parsing library (for fun).

I wrote a Lexer class which generates a sequence of Tokens. Token is a base class for a hierarchy of subclasses, each representing different token type, with its own specific properties. For example, there is a subclass LiteralNumber (deriving from Literal and through it from Token), which has its own specific methods for dealing with numeric value of its lexeme. Methods for dealing with lexemes in general (retrieving their character string representation, position in the source etc.) are in the base class, Token, because they're general to all token types. Users of this class hierarchy can derive their own classes for specific token types not predicted by me.

Now I have a Parser class which reads the stream of tokens and tries to match them with its syntax definition. For example, it has a method matchExpression, which in turn calls matchTerm and this one calls matchFactor, which has to test if the current token is Literal or Name (both derived from Token base class).

The problem is:
I need to check now what is the type of the current token in the stream and whether it matches the syntax or not. If not, throw an EParseError exception. If yes, act accordingly to get its value in the expression, generate machine code, or do whatever the parser needs to do when the syntax matches.

But I've read much about that checking the type in runtime, and deciding from it, is a bad design™, and it should be refactored as polymorphic virtual methods. Of course, I agree with that.

So my first attempt was to put some type virtual method in the Token base class, which would be overrided by the derived classes and return some enum with type id.

But I already see a shortcomings of this approach: Users deriving from Token their own classes of tokens won't be able to add additional id's to the enum, which is in the library source! :-/ And the goal was to allow them for extending the hierarchy for new types of tokens when they'll need it.

I could also return some string from the type method, which would allow for easy defining new types.

But still, in both these cases, the information about base types is lost (only leaf type is returned from type method) and the Parser class wouldn't be able to detect the Literal derived type when someone would derive from it and override the type to return something other than "Literal".

And of course the Parser class, which also is meant for extending by users (that is, writing their own parsers, recognizing their own tokens and syntax) doesn't know what descendants of the Token class will be there in the future.

Many FAQs and books on design recommend in this scenario to take the behavior from the code which needs to decide by type, and put it into the virtual method overriden in derived classes. But I cannot imagine how could I put this behavior into the Token descendants, because it's not their busines, for example, to generate machine code, or evaluate expressions. Moreover, there are parts of the syntax which need to match more than one token, so there is no one particular token which I could put that behavior into. It's rather the responsibility of particular syntax rules, which could match more than one token as their terminal symbols.

Any ideas how to improve this design?

Wintery answered 9/9, 2011 at 13:13 Comment(8)
+1, I ask myself the same question every time I write a parser (and I’ve written several).Pharr
I would cite google style guide :"Do not hand-implement an RTTI-like workaround. The arguments against RTTI apply just as much to workarounds like class hierarchies with type tags." Personally I don't agree that runtime type checking is always bad thing.Velocipede
I know the rationale behind prefering RTTI in place of hand-crafted type tags. That's exactly the problem I've described in my question above (though probably not verbose enough). I'm looking for a way to replace this type-tag approach with something better and more flexible, which is already in the language. But I also heard about discrepancies of using these built-in RTTI mechanisms (unportability, performance loss etc.) so I'm curious if it's any much better.Wintery
Don't reinvent a square wheel, check boost::spirit for example.Dikdik
@Gene: Except for learning or when you can make it better and more appropriate for your own usage (e.g. I've skipped spirit because of large compile times for a simple LISP-like grammar with basic diagnostic output; LLVM was the most appropriate, but I've skipped that, too, as it injects a huge dependency into my project)Hemline
@phresnel -- if compile time is a problem, check AXE then, compilation is very fast and it's even easier to use; but in any case using polymorphism (or worse explicit type checking) is a terrible idea for a parser.Dikdik
@Gene: That's why I wrote "(for fun)" at the very beginning of my question. I'd use Spirit if it wouldn't be only for fun and for learning purposes. But thanks for advice. You write that "using polymorphism (or worse explicit type checking) is a terrible idea for a parser" (which I also know), but you didn't wrote which is BETTER. So you didn't really answered my question, only trying to be a smart-ass, which doesn't help me at all.Wintery
@Gene: As far as I see, AXE is neither libre software, nor cross platform, which is unfortunate as I think that C++0x suits well for such stuff. Also: Type checking has not really to do with polymorphism, and I think that if you use polymorphism right, it can be an awesome tool for parser authoring, as can be functional paradigms or aspect oriented ones. Even boost::spirit is polymorphic, and I think AXE is, too.Hemline
S
4

RTTI is well supported by all major C++ compilers. This includes at least GCC, Intel and MSVC. The portability concerns are really a thing of the past.

If it is the syntax you don't like then here is a nice solution to pretty up RTTI:

class Base {
public:
  // Shared virtual functions
  // ...

  template <typename T>
  T *instance() {return dynamic_cast<T *>(this);}
};

class Derived : public Base {
  // ...
};

// Somewhere in your code
Base *x = f();

if (x->instance<Derived>()) ;// Do something

// or
Derived *d = x->instance<Derived>();

A common alternative to RTTI for a parser AST using virtual function overloading, without maintaining your own type enumeration, is to use the visitor pattern but in my experience that quickly becomes a PITA. You still have to maintain the visitor class but this can be sub-classed and extended. You will end up with a lot of boilerplate code all for the sake of avoiding RTTI.

Another option is just to create virtual functions for the syntactic types you are interested in. Such as isNumeric() which returns false in the Token base class but is overridden ONLY in Numeric classes to return true. If you provide default implementations for your virtual functions and let the subclasses override only when they need to then much of your problems will disappear.

RTTI is not as bad TM as it once was. Check the dates on the articles you are reading. One could also argue that pointers are a very bad idea but then you end up with languages like Java.

Stuppy answered 9/9, 2011 at 21:24 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.