Using boost::spirit::qi to parse numbers with separators
Asked Answered
P

2

3

I am attempting to use boost::spirit::qi to do some parsing. It's actually going quite well, and I successfully have managed to parse numbers in various bases based on a suffix. Examples: 123, c12h, 777o, 110101b.

I then wanted to add the ability to allow a completely ignored separator character, to allow values like 123_456 or 1101_0011b to parse. I tried using the skip parser, but I highly suspect that I completely misunderstood how it was to be used. It compiles just fine, but my attempt to make it ignore the underscore does absolutely nothing at all. Any suggestions on how to make this do what I want would be appreciated. My test code is included below:

#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>

namespace qi = boost::spirit::qi;
namespace ascii = boost::spirit::ascii;
using qi::_val;
using qi::_1;
using qi::skip;
using qi::uint_parser;
using ascii::char_;

template <typename Iterator>
struct unsigned_parser : qi::grammar<Iterator, uint64_t()> {

    unsigned_parser() : unsigned_parser::base_type(start) {
        uint_parser<uint64_t, 10> dec_parser;
        uint_parser<uint64_t, 16> hex_parser;
        uint_parser<uint64_t, 8> oct_parser;
        uint_parser<uint64_t, 2> bin_parser;

        start = skip(char_('_'))[
            /* binary with suffix */
            (bin_parser[_val=_1] >> char_("bByY"))
            /* octal with suffix */
            | (oct_parser[_val=_1] >> char_("qQoO"))
            /* hexadecimal with suffix */
            | (hex_parser[_val=_1] >> char_("hHxX"))
            /* decimal with optional suffix */
            | (dec_parser[_val=_1] >> -char_("dDtT"))
            ];
    }

    qi::rule<Iterator, uint64_t()> start;
};

int main(int argv, const char *argc[]) {
    typedef std::string::const_iterator iter;
    unsigned_parser<iter> up;
    uint64_t val;
    if (argv != 2) {
        std::cerr << "Usage: " << argc[0] << " <input>" << std::endl;
        return 1;
    }
    std::string test(argc[1]);
    iter i = test.begin();
    iter end = test.end();
    bool rv = parse(i, end, up, val);
    if (rv && i == end) {
        std::cout << "Succeeded: " << val << std::endl;
        return 0;
    }
    if (rv) {
        std::cout << "Failed partial parse: " << val << std::endl;
        return 1;
    }
    std::cout << "Failed." << std::endl;
    return 1;
}
Petersham answered 18/3, 2015 at 21:14 Comment(0)
C
4

Aw. Nobody should have to bother with implementation details like Spirit parser contexts unless you're extending the library and implementing your own parser directives.

Until that time, phoenix::function<>, phoenix::bind or even BOOST_PHOENIX_ADAPT_FUNCTION should be plenty for anyone.

Here are two approaches to your question without any patches to the library.

  1. Straightforward parsing Live On Coliru

    This could be viewed as the "naive" way of parsing the different styles of integers using just Qi and simple semantic actions:

    start = 
          eps [_val=0] >> +(char_("0-9a-fA-F") [ _val = _val*16 + _decode(_1) ] | '_')>>  char_("hHxX") /* hexadecimal with suffix */
        | eps [_val=0] >> +(char_("0-7")       [ _val = _val* 8 + _decode(_1) ] | '_')>>  char_("qQoO") /* octal       with suffix */
        | eps [_val=0] >> +(char_("01")        [ _val = _val* 2 + _decode(_1) ] | '_')>>  char_("bByY") /* binary      with suffix */
        | eps [_val=0] >> +(char_("0-9")       [ _val = _val*10 + _decode(_1) ] | '_')>> -char_("dDtT") /* decimal     with optional suffix */
        ;
    

    Of course, you will want to know what _decode looks like. Well you define it yourself:

    struct decode {
        template <typename> struct result { typedef int type; };
        template <typename Ch> int operator()(Ch ch) const {
            if (ch>='0' && ch<='9') return ch - '0';
            if (ch>='a' && ch<='z') return ch - 'a' + 10;
            if (ch>='A' && ch<='Z') return ch - 'A' + 10;
            assert(false);
        }
    };
    boost::phoenix::function<decode> _decode;
    
  2. Using BOOST_PHOENIX_ADAPT_FUNCTION macro Live On Coliru

    Instead of defining the function object you can use the macro

    int decode(char ch) {
        if (ch>='0' && ch<='9') return ch - '0';
        if (ch>='a' && ch<='z') return ch - 'a' + 10;
        if (ch>='A' && ch<='Z') return ch - 'A' + 10;
        assert(false);
    }
    
    BOOST_PHOENIX_ADAPT_FUNCTION(int, _decode, decode, 1)
    
  3. Using std::strtoul Live On Coliru

    Of course, the above may be a tad "complex" because it requires you to deal with nitty gritty details of integer arithmetics and digit decoding.

    Also, the "naive" approach does some duplicate work in case the literal is a decimal value like "101_101". It will calculate the subresult for the hex, octal and binary branches before realizing it was a decimal.

    So we could change the order around:

    start = 
            (raw[+char_("_0-9a-fA-F")] >>  char_("hHxX")) [ _val = _strtoul(_1,16) ] /* hexadecimal with suffix */
          | (raw[+char_("_0-7")]       >>  char_("qQoO")) [ _val = _strtoul(_1, 8) ] /* octal       with suffix */
          | (raw[+char_("_01")]        >>  char_("bByY")) [ _val = _strtoul(_1, 2) ] /* binary      with suffix */
          | (raw[+char_("_0-9")]       >> -char_("dDtT")) [ _val = _strtoul(_1,10) ] /* decimal     with optional suffix */
          ;
    

    Again you will be curious how we implemented _evaluate? It's a function that takes the synthesized attributes from raw (which is an iterator range) and the base, which is definitely known by then:

    struct strtoul_f {
        template <typename, typename> struct result { typedef uint64_t type; };
        template <typename Raw, typename Int> uint64_t operator()(Raw raw, Int base) const {
            std::string s(raw.begin(), raw.end());
            s.erase(std::remove(s.begin(), s.end(), '_'), s.end());
            char *f(&s[0]), *l(f+s.size());
            return std::strtoul(f, &l, base);
        }
    };
    boost::phoenix::function<strtoul_f> _strtoul;
    

    As you can see, the only complexity is removing the _ from the range first.

Crinite answered 22/3, 2015 at 13:51 Comment(1)
Added the version with BOOST_PHOENIX_ADAPT_FUNCTION Live On ColiruCrinite
C
3

If you really want to do this the "nice" way, you'd have to hack it into extract_int in numeric_utils.hpp.

Even better, you'd want to make it a strategy class much like with the real_policies used by real_parser. Because just mixing in more branches with the existing general purpose integer handling code just complicates that and has the potential to slow down any integer parsing.

I have not done this. However, I do have a proof-of-concept approach here:

Mind you, this is not well tested and not fit for serious use for the reasons stated, but you can use it as inspiration. You might want to just duplicate the uint_parser directive as-a-whole and stick it in your Spirit Repository location.


The patch

  1. It's relatively straightforward. If you define ALLOW_SO_UNDERSCORE_HACK you will get the bypass for underscore inserted into the loop unrolling macros:

    #if defined(ALLOW_SO_UNDERSCORE_HACK)
    #   define SPIRIT_SO_SKIP_UNDERSCORE_HACK()                                   \
                    if ('_' == *it) {                                             \
                        ++it;                                                     \
                        continue;                                                 \
                    }
    #else
    #   define SPIRIT_SO_SKIP_UNDERSCORE_HACK()
    #endif
    

    The only real complexity there is from "seeing through: the optimizations made in that translation unit.

  2. There's a rather arbitrary choice to (dis)allow underscores amonge the leading zeros. I have opted to do so:

    #if defined(ALLOW_SO_UNDERSCORE_HACK)
                    // skip leading zeros
                    for(;it != last;++it) {
                        if ('0' == *it && leading_zeros < MaxDigits) {
                            ++leading_zeros;
                            continue;
                        } else if ('_' == *it) {
                            continue;
                        }
                        break;
                    }
    #else
    
  3. Finally, uderscores are not counted towards the MinDigits and MaxDigits limits

DEMO

The following test program demonstrates things. Note The reordering of branches.

#include <boost/spirit/include/qi.hpp>

namespace qi = boost::spirit::qi;

template <typename Iterator>
struct unsigned_parser : qi::grammar<Iterator, uint64_t()> {

    unsigned_parser() : unsigned_parser::base_type(start) {
        using namespace qi;
        uint_parser<uint64_t, 10> dec_parser;
        uint_parser<uint64_t, 16> hex_parser;
        uint_parser<uint64_t, 8> oct_parser;
        uint_parser<uint64_t, 2> bin_parser;

        start = eps(false)
            | (hex_parser >> omit[ char_("hHxX")]) /* hexadecimal with suffix */
            | (oct_parser >> omit[ char_("qQoO")]) /* octal with suffix */
            | (bin_parser >> omit[ char_("bByY")]) /* binary with suffix */
            | (dec_parser >> omit[-char_("dDtT")]) /* decimal with optional suffix */
            ;
    }

    qi::rule<Iterator, uint64_t()> start;
};

int main(int argv, const char *argc[]) {
    typedef std::string::const_iterator iter;
    unsigned_parser<iter> up;

    for (auto const& test : std::vector<std::string>(argc+1, argc+argv)) {
        iter i = test.begin(), end = test.end();

        uint64_t val;
        bool rv = parse(i, end, up, val);

        std::cout << (rv?"Successful":"Failed") << " parse: '" << test << "' -> " << val << "\n";

        if (i != end)
            std::cout << " ** Remaining unparsed: '" << std::string(i,end) << "'\n";
    }
}

If you call it with command line arguments 123_456 123456 1_bh 0_010Q 1010_1010_0111_0111_b it will print:

Successful parse: '123_456' -> 123456
Successful parse: '123456' -> 123456
Successful parse: '1_bh' -> 27
Successful parse: '0_010Q' -> 8
Successful parse: '1010_1010_0111_0111_b' -> 43639

LISTING

Full patch (on boost-1.57.0 tag) for preservation on SO:

commit 24b16304f436bfd0f6e2041b2b7be0c8677c7e75
Author: Seth Heeren <[email protected]>
Date:   Thu Mar 19 01:44:55 2015 +0100

    https://mcmap.net/q/1481920/-using-boost-spirit-qi-to-parse-numbers-with-separators

    rough patch for exposition of my answer only

diff --git a/include/boost/spirit/home/qi/numeric/detail/numeric_utils.hpp b/include/boost/spirit/home/qi/numeric/detail/numeric_utils.hpp
index 5137f87..1ced164 100644
--- a/include/boost/spirit/home/qi/numeric/detail/numeric_utils.hpp
+++ b/include/boost/spirit/home/qi/numeric/detail/numeric_utils.hpp
@@ -262,10 +262,21 @@ namespace boost { namespace spirit { namespace qi { namespace detail
    ///////////////////////////////////////////////////////////////////////////
    //  extract_int: main code for extracting integers
    ///////////////////////////////////////////////////////////////////////////
+#if defined(ALLOW_SO_UNDERSCORE_HACK)
+#   define SPIRIT_SO_SKIP_UNDERSCORE_HACK()                                   \
+                if ('_' == *it) {                                             \
+                    ++it;                                                     \
+                    continue;                                                 \
+                }
+#else
+#   define SPIRIT_SO_SKIP_UNDERSCORE_HACK()
+#endif
+
#define SPIRIT_NUMERIC_INNER_LOOP(z, x, data)                                 \
        if (!check_max_digits<MaxDigits>::call(count + leading_zeros)         \
            || it == last)                                                    \
            break;                                                            \
+        SPIRIT_SO_SKIP_UNDERSCORE_HACK()                                      \
        ch = *it;                                                             \
        if (!radix_check::is_valid(ch) || !extractor::call(ch, count, val))   \
            break;                                                            \
@@ -301,12 +312,25 @@ namespace boost { namespace spirit { namespace qi { namespace detail
            std::size_t leading_zeros = 0;
            if (!Accumulate)
            {
+#if defined(ALLOW_SO_UNDERSCORE_HACK)
+                // skip leading zeros
+                for(;it != last;++it) {
+                    if ('0' == *it && leading_zeros < MaxDigits) {
+                        ++leading_zeros;
+                        continue;
+                    } else if ('_' == *it) {
+                        continue;
+                    }
+                    break;
+                }
+#else
                // skip leading zeros
                while (it != last && *it == '0' && leading_zeros < MaxDigits)
                {
                    ++it;
                    ++leading_zeros;
                }
+#endif
            }

            typedef typename
@@ -366,6 +390,7 @@ namespace boost { namespace spirit { namespace qi { namespace detail
#define SPIRIT_NUMERIC_INNER_LOOP(z, x, data)                                 \
        if (it == last)                                                       \
            break;                                                            \
+        SPIRIT_SO_SKIP_UNDERSCORE_HACK()                                      \
        ch = *it;                                                             \
        if (!radix_check::is_valid(ch))                                       \
            break;                                                            \
@@ -399,12 +424,25 @@ namespace boost { namespace spirit { namespace qi { namespace detail
            std::size_t count = 0;
            if (!Accumulate)
            {
+#if defined(ALLOW_SO_UNDERSCORE_HACK)
+                // skip leading zeros
+                for(;it != last;++it) {
+                    if ('0' == *it) {
+                        ++count;
+                        continue;
+                    } else if ('_' == *it) {
+                        continue;
+                    }
+                    break;
+                }
+#else
                // skip leading zeros
                while (it != last && *it == '0')
                {
                    ++it;
                    ++count;
                }
+#endif

                if (it == last)
                {
@@ -472,6 +510,7 @@ namespace boost { namespace spirit { namespace qi { namespace detail
    };

#undef SPIRIT_NUMERIC_INNER_LOOP
+#undef SPIRIT_SO_SKIP_UNDERSCORE_HACK

    ///////////////////////////////////////////////////////////////////////////
    // Cast an signed integer to an unsigned integer
Crinite answered 19/3, 2015 at 1:20 Comment(8)
Thank you for the very nifty work you have done here. I, unfortunately, can't use it unless I copy all the relevant headers (need to work with existing boost installs). But I really appreciate the effort you put in, and the understanding it has given me.Petersham
You are aware that if Spirit is header-only, right? Also, even if you went the "deep" way and created your own parser directive, you could opt to fit that in ~2 headers of your own. Even if you don't, why don't you just pre-include your version of that header? Include guards do the rest...Crinite
Here's a live demo: numeric_utils.hpp (patched); Here's the program with the hack disabled: main.cpp and here it is with the hack enabled: main.cpp. Now if I can do this "with existing boost installs" on an online compiler service, I think it's fair to say you have this option too :)Crinite
@Petersham What do you think? (feedback appreciated)Crinite
I'm sorry. You obviously did a good amount of work to produce you (quite informative) answer, so the least I can do is give you my underlying reasons. 1) As you say, it's a header-only library, so I can just copy an modify a header. But the next time qi is updated in a future version of boost, I may need to copy can modify it again. And it might not be me that owns the code any more. 2) I don't like copying code that I don't understand well. It's fine code (I assume), but it is very dense, and quite opaque. More in my next comment...Petersham
In point of fact, this is not the only problem I have run into attempting to use qi. In fact, I grew so frustrated with it, that I resolved to not use it at all. A day later, I calmed down, and am exploring it again, but I still have some problems with it. 1) The documentation is shallow. It is not bad. In fact, it's better than much boost documentation. But it is shallow. For example, I could not find out from the documentation that the third bool argument to an action function object was a pass variable. There are many important details left out. 2) The code is very opaque. More...Petersham
When documentation fails, read the source. But the source is so heavily template-meta-programmed, it's difficult to find anything. I searched for a very long time before I was able to determine what file used the third argument of an action object. I do not fault the design; this is what makes qi flexible and fast. But it makes learning by inspection very, very difficult. In summary, I probably really need a book to understand it, or many questions on its mailing list. But I do thank you for your answer, and I am still playing with using spirit in my parser.Petersham
Oh dear lord. Who told you needed all that :( Semantic actions use Boost Phoenix actors, which are specifically designed to make interacting with parser context user friendly. All that cruft is really library implmenentation detail. In fact you've inspired me to write another answer showing you the completely naive and slightly smarter Pure-Qi approaches.Crinite

© 2022 - 2024 — McMap. All rights reserved.