It's true that you cannot edit the definition of yyFlexLexer
, since FlexLexer.h
is effectively a system-wide header file. But you can certainly subclass it, which will provide most of what you need.
Subclassing yyFlexLexer
Flex allows you to use %option yyclass
(or the --yyclass
command-line option) to specify the name of a subclass, which will be used instead of yyFlexLexer
to define yylex
. Subclassing yyFlexLexer
allows you to include your own header which defines your subclass' members and maybe even additional functions, as well as its constructors; in short, if your intention was simply to fill in a std::vector<alpha_token_t>
with the successive tokens, you could easily do that by defining AlphaLexer
as a subclass of yyFlexLexer
, with an instance member called tokens
(or, perhaps, with accessor functions).
You can also add additional member functions to your new class, which might provide what you need those additional arguments for.
The thing which is not quite so straight-forward, although it could easily be accomplished using the YY_DECL
macro in the C interface, is to change the name and prototype of the scanning function generated by flex. It can be done (see below) but it is not clear that it is actually supported. In any case, it is possibly less important in the case of C++.
Aside from a small wrinkle created by the curious organization of Flex's C++ classes [Note 1], subclassing the lexer class is simple. You need to derive your class from yyFlexLexer
[Note 2], which is declared in FlexLexer.h
, and you need to tell Flex what the name of your class is, either by using %option yyclass
in your Flex file, or by specifying the name on the command line with --yyclass
.
yyFlexLexer
includes the various methods for manipulating input buffers, as well as all the mutable state for the lexical scanner used by the standard skeleton. (Much of this is actually derived from the base class FlexLexer
.) It also includes a virtual yylex
method with prototype
virtual int yylex();
When you subclass yyFlexLexer
, yyFlexLexer::yylex()
is defined to signal an error by calling yyFlexLexer::LexerError(const char*)
and the generated scanner is defined as the override in the class defined as yyclass
. (If you don't subclass, the generated scanner is yyFlexLexer::yylex()
.)
The one wrinkle is the way you need to declare your subclass. Normally, you would do that in a header file like this:
File: myscanner.h (Don't use this version)
#pragma once
// DON'T DO THIS; IT WON'T WORK (flex 2.6)
#include <yyFlexLexer.h>
class MyScanner : public yyFlexLexer {
// whatever
};
You would then #include "myscanner.h"
in any file which needed to use the scanner, including the generated scanner itself.
Unfortunately, that won't work because it will result in FlexLexer.h
being included twice in the generated scanner; FlexLexer.h
does not have an include guard in the normal sense of the word because it is designed to be included multiple times in order to support the prefix
option. So you need to define two header files:
File: myscanner-internal.h
#pragma once
// This file depends on FlexLexer.h having already been included
// in the translation unit. Don't use it other than in the scanner
// definition.
class MyScanner : public yyFlexLexer {
// whatever
};
File: myscanner.h
#pragma once
#include <FlexLexer.h>
#include "myscanner.h"
Then you use #include "myscanner.h"
in every file which needs to know about the scanner except the scanner definition itself. In your myscanner.ll
file, you will #include "myscanner-internal.h"
, which works because Flex has already included FlexLexer.h
before it inserts the prologue C++ code from your scanner definition.
Changing the yylex prototype
You can't really change the prototype (or name) of yylex
, because it is declared in FlexLexer.h
and, as mentioned above, defined to signal an error. You can, however, redefine YY_DECL
to create a new scanner interface. To do so, you must first #undef
the existing YY_DECL
definition, at least in your scanner definition, because a scanner with %option yyclass="MyScanner" contains
#define YY_DECL int MyScanner::yylex(). That would make your
myscanner-internal.h` file look like this:
#pragma once
// This file depends on FlexLexer.h having already been included
// in the translation unit. Don't use it other than in the scanner
// definition.
#undef YY_DECL
#define YY_DECL int MyScanner::alpha_yylex(std::vector<alpha_token_t>& tokens)
#include <vector>
#include "alpha_token.h"
class MyScanner : public yyFlexLexer {
public:
int alpha_yylex(std::vector<alpha_token_t>& tokens);
// whatever else you need
};
The fact that the MyScanner
object still has a (not very functional) yylex
method might not be a problem. There are some undocumented interfaces in FlexLexer
which call yylex()
, but those don't matter if you don't use them. (They're not all that useful, anyway.) But you should at least be aware that the interface exists.
In any case, I don't see the point of renaming yylex
(but perhaps you have a different aesthetic sense). It's already effectively namespaced by being a member of a specific class (MyScanner
, above), so yylex
doesn't really create any confusion.
In the particular case of the std::vector<alpha_token_t>&
argument, it seems to me that a cleaner solution would be to put the reference as a member variable in the MyScanner
class and set it with the constructor or with an accessor method. Unless you actually use different vectors at different points in the lexical analysis -- not evident in the example code in your question -- there's no point burdening every call site with the need to pass the address of the vector into the yylex
call. Since lexer actions are compiled inside yylex
, which is a member function of MyScanner
, instance variables -- even private instance variables -- are usable in the lexer actions. Of course, that's not the only use case for extra yylex
arguments, but it's a pretty common one.
Notes
"The C++ interface is a mess," according to a comment in the generated code.
Using %option prefix
, you can change yy
to something else if you want to. This a feature which is supposedly intended to allow you to include multiple lexical scanners in the same project. However, if you're planning on subclassing, the base classes for all these lexical scanners will be identical (other than their names). Thus, there is little or no point having different base classes. Renaming the scanner class using %option prefix
is less flexible and no more efficient than subclassing, and it creates an additional header complication. (See this older answer for details.) So I'd recommend sticking with subclassing.
yyFlexLexer::alpha_yylex(std::vector<alpha_token_t>& tokens)
(note the &)? If so, wouldn't it be sufficient (and maybe even better) to add the vector of tokens as a class data member? – Kathrynekathy