Spirit X3: parser with internal state
Asked Answered
B

1

2

I want to efficiently parse large CSV-like files, whose order of columns I get at runtime. With Spirit Qi, I would parse each field with a lazy auxiliary parser that would select at runtime which column-specific parser to apply to each column. But X3 doesn't seem to have lazy (despite that it's listed in documentation). After reading recommendations here on SO, I've decided to write a custom parser.

It ended up being pretty nice, but now I've noticed I don't really need the pos variable be exposed anywhere outside the custom parser itself. I've tried putting it into the custom parser itself and started getting compiler errors stating that the column_value_parser object is read-only. Can I somehow put pos into the parser structure?

Simplified code that gets the compile-time error, with commented out parts of my working version:

#include <iostream>
#include <variant>

#include <boost/spirit/home/x3.hpp>
#include <boost/spirit/home/support.hpp>

namespace helpers {
    // https://bitbashing.io/std-visit.html
    template<class... Ts> struct overloaded : Ts... { using Ts::operator()...; };
    template<class... Ts> overloaded(Ts...) -> overloaded<Ts...>;
}

auto const unquoted_text_field = *(boost::spirit::x3::char_ - ',' - boost::spirit::x3::eol);

struct text { };
struct integer { };
struct real { };
struct skip { };
typedef std::variant<text, integer, real, skip> column_variant;

struct column_value_parser : boost::spirit::x3::parser<column_value_parser> {
    typedef boost::spirit::unused_type attribute_type;

    std::vector<column_variant>& columns;
    // size_t& pos;
    size_t pos;

    // column_value_parser(std::vector<column_variant>& columns, size_t& pos)
    column_value_parser(std::vector<column_variant>& columns)
        : columns(columns)
    //    , pos(pos)
        , pos(0)
    { }

    template<typename It, typename Ctx, typename Other, typename Attr>
    bool parse(It& f, It l, Ctx& ctx, Other const& other, Attr& attr) const {
        auto const saved_f = f;
        bool successful = false;

        visit(
            helpers::overloaded {
                [&](skip const&) {
                    successful = boost::spirit::x3::parse(f, l, boost::spirit::x3::omit[unquoted_text_field]);
                },
                [&](text& c) {
                    std::string value;
                    successful = boost::spirit::x3::parse(f, l, unquoted_text_field, value);
                    if(successful) {
                        std::cout << "Text: " << value << '\n';
                    }
                },
                [&](integer& c) {
                    int value;
                    successful = boost::spirit::x3::parse(f, l, boost::spirit::x3::int_, value);
                    if(successful) {
                        std::cout << "Integer: " << value << '\n';
                    }
                },
                [&](real& c) {
                    double value;
                    successful = boost::spirit::x3::parse(f, l, boost::spirit::x3::double_, value);
                    if(successful) {
                        std::cout << "Real: " << value << '\n';
                    }
                }
            },
            columns[pos]);

        if(successful) {
            pos = (pos + 1) % columns.size();
            return true;
        } else {
            f = saved_f;
            return false;
        }
    }
};


int main(int argc, char *argv[])
{
    std::string input = "Hello,1,13.7,XXX\nWorld,2,1e3,YYY";

    // Comes from external source.
    std::vector<column_variant> columns = {text{}, integer{}, real{}, skip{}};
    size_t pos = 0;

    boost::spirit::x3::parse(
        input.begin(), input.end(),
//         (column_value_parser(columns, pos) % ',') % boost::spirit::x3::eol);
        (column_value_parser(columns) % ',') % boost::spirit::x3::eol);
}

XY: My goal is to parse ~500 GB of pseudo-CSV files in a reasonable time on a machine with little RAM, convert into a list of (roughly) [row-number, column-name, value], then put into storage. The format is actually a little more complex than CSV: database dumps formatted in… human-friendly way, with column values being actually several small sublangauges (e.g. dates or, uh, something similar to whole apache log lines stuffed into a single field), and I'm often extracting only one specific part of each column. Different files may have different columns and in different order, which I can only learn by parsing yet another set of files containing original queries. Thankfully, Spirit makes it a breeze…

Brad answered 12/6, 2018 at 16:31 Comment(3)
The elephant in the room: why not export in, say, XML or work on the database instead of a dump?Pharaoh
@sehe, Oh, I wish it was done in a saner way… being just a side hobby project, I only convinced owners of that dataset to give me what they can easily inspect manually with little technical skills. It's still better than XLS, which they use on their own.Brad
I'd very much parse into custom AST nodes first (e.g. in shared/mapped memory containers) and implement all logic from there.Pharaoh
P
3

Three answers:

  1. The easiest fix is to make pos a mutable member
  2. The X3 hardcore answer is x3::with<>
  3. Functional composition

1. Making pos mutable

Live On Wandbox

#include <iostream>
#include <variant>

#include <boost/spirit/home/x3.hpp>
#include <boost/spirit/home/support.hpp>

namespace helpers {
    // https://bitbashing.io/std-visit.html
    template<class... Ts> struct overloaded : Ts... { using Ts::operator()...; };
    template<class... Ts> overloaded(Ts...) -> overloaded<Ts...>;
}

auto const unquoted_text_field = *(boost::spirit::x3::char_ - ',' - boost::spirit::x3::eol);

struct text { };
struct integer { };
struct real { };
struct skip { };
typedef std::variant<text, integer, real, skip> column_variant;

struct column_value_parser : boost::spirit::x3::parser<column_value_parser> {
    typedef boost::spirit::unused_type attribute_type;

    std::vector<column_variant>& columns;
    size_t mutable pos = 0;
    struct pos_tag;

    column_value_parser(std::vector<column_variant>& columns)
        : columns(columns)
    { }

    template<typename It, typename Ctx, typename Other, typename Attr>
    bool parse(It& f, It l, Ctx& /*ctx*/, Other const& /*other*/, Attr& /*attr*/) const {
        auto const saved_f = f;
        bool successful = false;

        visit(
            helpers::overloaded {
                [&](skip const&) {
                    successful = boost::spirit::x3::parse(f, l, boost::spirit::x3::omit[unquoted_text_field]);
                },
                [&](text&) {
                    std::string value;
                    successful = boost::spirit::x3::parse(f, l, unquoted_text_field, value);
                    if(successful) {
                        std::cout << "Text: " << value << '\n';
                    }
                },
                [&](integer&) {
                    int value;
                    successful = boost::spirit::x3::parse(f, l, boost::spirit::x3::int_, value);
                    if(successful) {
                        std::cout << "Integer: " << value << '\n';
                    }
                },
                [&](real&) {
                    double value;
                    successful = boost::spirit::x3::parse(f, l, boost::spirit::x3::double_, value);
                    if(successful) {
                        std::cout << "Real: " << value << '\n';
                    }
                }
            },
            columns[pos]);

        if(successful) {
            pos = (pos + 1) % columns.size();
            return true;
        } else {
            f = saved_f;
            return false;
        }
    }
};


int main() {
    std::string input = "Hello,1,13.7,XXX\nWorld,2,1e3,YYY";

    std::vector<column_variant> columns = {text{}, integer{}, real{}, skip{}};

    boost::spirit::x3::parse(
        input.begin(), input.end(),
        (column_value_parser(columns) % ',') % boost::spirit::x3::eol);
}

2. x3::with<>

This is similar but with better (re)entrancy and encapsulation:

Live On Wandbox

#include <iostream>
#include <variant>

#include <boost/spirit/home/x3.hpp>
#include <boost/spirit/home/support.hpp>

namespace helpers {
    // https://bitbashing.io/std-visit.html
    template<class... Ts> struct overloaded : Ts... { using Ts::operator()...; };
    template<class... Ts> overloaded(Ts...) -> overloaded<Ts...>;
}

auto const unquoted_text_field = *(boost::spirit::x3::char_ - ',' - boost::spirit::x3::eol);

struct text { };
struct integer { };
struct real { };
struct skip { };
typedef std::variant<text, integer, real, skip> column_variant;

struct column_value_parser : boost::spirit::x3::parser<column_value_parser> {
    typedef boost::spirit::unused_type attribute_type;

    std::vector<column_variant>& columns;

    column_value_parser(std::vector<column_variant>& columns)
        : columns(columns)
    { }

    template<typename It, typename Ctx, typename Other, typename Attr>
    bool parse(It& f, It l, Ctx const& ctx, Other const& /*other*/, Attr& /*attr*/) const {
        auto const saved_f = f;
        bool successful = false;

        size_t& pos = boost::spirit::x3::get<pos_tag>(ctx).value;

        visit(
            helpers::overloaded {
                [&](skip const&) {
                    successful = boost::spirit::x3::parse(f, l, boost::spirit::x3::omit[unquoted_text_field]);
                },
                [&](text&) {
                    std::string value;
                    successful = boost::spirit::x3::parse(f, l, unquoted_text_field, value);
                    if(successful) {
                        std::cout << "Text: " << value << '\n';
                    }
                },
                [&](integer&) {
                    int value;
                    successful = boost::spirit::x3::parse(f, l, boost::spirit::x3::int_, value);
                    if(successful) {
                        std::cout << "Integer: " << value << '\n';
                    }
                },
                [&](real&) {
                    double value;
                    successful = boost::spirit::x3::parse(f, l, boost::spirit::x3::double_, value);
                    if(successful) {
                        std::cout << "Real: " << value << '\n';
                    }
                }
            },
            columns[pos]);

        if(successful) {
            pos = (pos + 1) % columns.size();
            return true;
        } else {
            f = saved_f;
            return false;
        }
    }

    template <typename T>
    struct Mutable { T mutable value; };
    struct pos_tag;

    auto invoke() const {
        return boost::spirit::x3::with<pos_tag>(Mutable<size_t>{}) [ *this ];
    }
};


int main() {
    std::string input = "Hello,1,13.7,XXX\nWorld,2,1e3,YYY";

    std::vector<column_variant> columns = {text{}, integer{}, real{}, skip{}};
    column_value_parser p(columns);

    boost::spirit::x3::parse(
        input.begin(), input.end(),
        (p.invoke() % ',') % boost::spirit::x3::eol);
}

3. Functional Composition

Because it's so much easier in X3, my favourite is to just generate the parser on demand.

Without requirements, this is the simplest I'd propose:

Live On Wandbox

#include <boost/spirit/home/x3.hpp>
namespace x3 = boost::spirit::x3;

namespace CSV {
    struct text    { };
    struct integer { };
    struct real    { };
    struct skip    { };

    auto const unquoted_text_field = *~x3::char_(",\n");
    static inline auto as_parser(skip)    { return x3::omit[unquoted_text_field]; }
    static inline auto as_parser(text)    { return unquoted_text_field;           }
    static inline auto as_parser(integer) { return x3::int_;                      }
    static inline auto as_parser(real)    { return x3::double_;                   }

    template <typename... Spec>
    static inline auto line_parser(Spec... spec) {
        auto delim = ',' | &(x3::eoi | x3::eol);
        return ((as_parser(spec) >> delim) >> ... >> x3::eps);
    }

    template <typename... Spec> static inline auto csv_parser(Spec... spec) {
        return line_parser(spec...) % x3::eol;
    }
}

#include <iostream>
#include <iomanip>
using namespace CSV;

int main() {
    std::string const input = "Hello,1,13.7,XXX\nWorld,2,1e3,YYY";
    auto f = begin(input), l = end(input);

    auto p = csv_parser(text{}, integer{}, real{}, skip{});

    if (parse(f, l, p)) {
        std::cout << "Parsed\n";
    } else {
        std::cout << "Failed\n";
    }

    if (f!=l) {
        std::cout << "Remaining: " << std::quoted(std::string(f,l)) << "\n";
    }
}

A version with debug information enabled:

Live On Wandbox

<line>
  <try>Hello,1,13.7,XXX\nWor</try>
  <CSV::text>
    <try>Hello,1,13.7,XXX\nWor</try>
    <success>,1,13.7,XXX\nWorld,2,</success>
  </CSV::text>
  <CSV::integer>
    <try>1,13.7,XXX\nWorld,2,1</try>
    <success>,13.7,XXX\nWorld,2,1e</success>
  </CSV::integer>
  <CSV::real>
    <try>13.7,XXX\nWorld,2,1e3</try>
    <success>,XXX\nWorld,2,1e3,YYY</success>
  </CSV::real>
  <CSV::skip>
    <try>XXX\nWorld,2,1e3,YYY</try>
    <success>\nWorld,2,1e3,YYY</success>
  </CSV::skip>
  <success>\nWorld,2,1e3,YYY</success>
</line>
<line>
  <try>World,2,1e3,YYY</try>
  <CSV::text>
    <try>World,2,1e3,YYY</try>
    <success>,2,1e3,YYY</success>
  </CSV::text>
  <CSV::integer>
    <try>2,1e3,YYY</try>
    <success>,1e3,YYY</success>
  </CSV::integer>
  <CSV::real>
    <try>1e3,YYY</try>
    <success>,YYY</success>
  </CSV::real>
  <CSV::skip>
    <try>YYY</try>
    <success></success>
  </CSV::skip>
  <success></success>
</line>
Parsed

Notes, Caveats:

  • With anything mutable, beware of side-effects. E.g. if you have a | b and a includes column_value_parser, the side-effect of incrementing pos will not be rolled back when a fails and b is matched instead.

    In short, this makes your parse function impure.

Pharaoh answered 12/6, 2018 at 19:36 Comment(6)
I was thinking of making it mutable, but I assumed x3 constifies parser for a good reason. If you say it's ok, then great! This is a hobby project, so hardcore methods are even better though (-: I'll wait for the 3rd choice, though I wonder—I hope building parser on demand won't make it slower?Brad
Have you seen the caveat? Yes, x3 constifies for good reasons. Mainly separation of state and static logic, meaning the compiler has a good handle on optimizing. Whether generating on demand makes it slower depends, exclusively, on when you generate and how often you re-use.Pharaoh
Out of curiosity, and so any sample I concoct later can have real life applicability, what's the goal? You're now just printing and "throwing" away the attributes. I'm assuming there's a target data structure.Pharaoh
Without requirements, this is the simplest I'd propose: wandbox.org/permlink/jXKEAskTctV2iDVT - In the absense of a known goal, I'll refrain from either parsing into adapted structs or parsing into generic containers based on truly dynamic CSV spec.Pharaoh
Updated the answer. Also see e.g. this older CSV example in X3 vs. Qi. The context for that was this discussion. Note that the performance figures there are probably skewedPharaoh
I've seen the caveat, and was worried there's more behind it; I don't expect this parser to backtrack assuming that I know upfront what field type to parse. I see that the third approach requires knowing types at compile time; but I assume concatenating parsers with > at runtime should be good as well? Got curious, I'll try it on my dataset. Added some description of my goals to the question.Brad

© 2022 - 2024 — McMap. All rights reserved.