parsing identifiers except keywords
Asked Answered
E

2

5

I am struggeling writing a identifier parser, which parses a alphanum string which is not a keyword. the keywords are all in a table:

struct keywords_t : x3::symbols<x3::unused_type> {
    keywords_t() {
        add("for", x3::unused)
                ("in", x3::unused)
                ("while", x3::unused);
    }
} const keywords;

and the parser for a identifier should be this:

auto const identifier_def =       
            x3::lexeme[
                (x3::alpha | '_') >> *(x3::alnum | '_')
            ];

now i try to combine these so an identifier parser fails on parsing a keyword. I tried it like this:

auto const identifier_def =       
                x3::lexeme[
                    (x3::alpha | '_') >> *(x3::alnum | '_')
                ]-keywords;

and this:

auto const identifier_def =       
                x3::lexeme[
                    (x3::alpha | '_') >> *(x3::alnum | '_') - keywords
                ];

it works on most inputs but if a string starts with a keyword like like int, whilefoo, forbar the parser fails to parse this strings. how can i get this parser correct?

Etamine answered 26/6, 2016 at 13:59 Comment(2)
You may want to look at LLVM's libtooling : clang.llvm.org/docs/LibTooling.htmlOringa
I also expected such a semantics of operator -, but it is rather different. There is some related discussion here.Ornithorhynchus
T
7

Your problem is caused by the semantics of the difference operator in Spirit. When you have a - b Spirit does the following:

  • check whether b matches:
    • if it does, a - b fails and nothing is parsed.
    • if b fails then it checks whether a matches:
      • if a fails, a - b fails and nothing is parsed.
      • if a succeeds, a - b succeeds and parses whatever a parses.

In your case (unchecked_identifier - keyword) as long as the identifier starts with a keyword, keyword will match and your parser will fail. So you need to exchange keyword with something that matches whenever a distinct keyword is passed, but fails whenever the keyword is followed by something else. The not predicate (!) can help with that.

auto const distinct_keyword = x3::lexeme[ keyword >> !(x3::alnum | '_') ];

Full Sample (Running on Coliru):

//#define BOOST_SPIRIT_X3_DEBUG
#include <iostream>
#include <boost/spirit/home/x3.hpp>

namespace parser {
    namespace x3 = boost::spirit::x3;

    struct keywords_t : x3::symbols<x3::unused_type> {
        keywords_t() {
            add("for", x3::unused)
                    ("in", x3::unused)
                    ("while", x3::unused);
        }
    } const keywords;

    x3::rule<struct identifier_tag,std::string>  const identifier ("identifier");

    auto const distinct_keyword = x3::lexeme[ keywords >> !(x3::alnum | '_') ];
    auto const unchecked_identifier = x3::lexeme[(x3::alpha | x3::char_('_')) >> *(x3::alnum | x3::char_('_'))];


    auto const identifier_def = unchecked_identifier - distinct_keyword;

    //This should also work:
    //auto const identifier_def = !distinct_keyword >> unchecked_identifier


    BOOST_SPIRIT_DEFINE(identifier);

    bool is_identifier(const std::string& input)
    {
        auto iter = std::begin(input), end= std::end(input);

        bool result = x3::phrase_parse(iter,end,identifier,x3::space);

        return result && iter==end;
    }
}



int main() {

    std::cout << parser::is_identifier("fortran") << std::endl;
    std::cout << parser::is_identifier("for") << std::endl;
    std::cout << parser::is_identifier("integer") << std::endl;
    std::cout << parser::is_identifier("in") << std::endl;
    std::cout << parser::is_identifier("whileechoyote") << std::endl;
    std::cout << parser::is_identifier("while") << std::endl;
}
Tacmahack answered 26/6, 2016 at 13:59 Comment(0)
B
2

The problem is, that this runs without a lexer, that is, if you write

keyword >> *char_

And put in whilefoo it will parse while as keyword and foo as the *char_.

You can prevent that in two ways: either require to have a space after the keyword, i.e.

auto keyword_rule = (keyword >> x3::space);
//or if you use phrase_parse
auto keyword_rule = x3::lexeme[keyword >> x3::space];

The other way you described is also possible, i.e. remove the keyword from the string explicitly (I'd do it that way):

auto string = x3::lexeme[!keyword >> (x3::alpha | '_') >> *(x3::alnum | '_')];

The problem with your definition is, that it will interpret the first set of chars as the keyword, thereby choosing to not parse it at all. The 'x-y' operator means, parse x, but not y. But if you pass 'whilefoo' it will interpret 'while' as the keyword and therefor not parse at all.

Barfuss answered 26/6, 2016 at 16:5 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.