Boost Spirit (X3) symbol tables resulting in UTF8 strings
Asked Answered
I

1

6

I'm trying to parse LaTeX escape codes (e.g. \alpha) to the Unicode (Mathematical) characters (i.e. U+1D6FC).

Right now this means I am using this symbols parser (rule):

struct greek_lower_case_letters_ : x3::symbols<char32_t>
{
  greek_lower_case_letters_::greek_lower_case_letters_()
  {
    add("alpha",   U'\u03B1');
  }
} greek_lower_case_letter;

This works fine but means I'm getting a std::u32string as a result. I'd like an elegant way to keep the Unicode code points in the code (for maybe future automation) and maintenance reasons. Is there a way to get this kind of parser to parse into a UTF-8 std::string?

I thought of making the symbols struct parse to a std::string, but that would be highly inefficient (I know, premature optimization bla bla).

I was hoping there was some elegant way instead of going through a bunch of hoops to get this working (symbols appending strings to the result).

I do fear though that using the code point values and wanting UTF8 will incur a runtime cost of the conversion (or is there a constexpr UTF32->UTF8 conversion possibe?).

Illuse answered 18/12, 2015 at 20:50 Comment(0)
E
7

The JSON parser example at cierelabs shows an approach that uses semantic actions to append code points in utf8 encoding:

  auto push_utf8 = [](auto& ctx)
  {
     typedef std::back_insert_iterator<std::string> insert_iter;
     insert_iter out_iter(_val(ctx));
     boost::utf8_output_iterator<insert_iter> utf8_iter(out_iter);
     *utf8_iter++ = _attr(ctx);
  };

  // ...

  auto const escape =
         ('u' > hex4)           [push_utf8]
     |   char_("\"\\/bfnrt")    [push_esc]
     ;

This is used in their

typedef x3::rule<unicode_string_class, std::string> unicode_string_type;

Which, as you can see, build the utf8 sequence into a std::string attribute.

See for full code: https://github.com/cierelabs/json_spirit/blob/x3_devel/ciere/json/parser/x3_grammar_def.hpp

Elmore answered 18/12, 2015 at 20:54 Comment(5)
I decided using std::string as symbol key/value, and I'm trying to get the char_ rule to work as a sequence using the repeat directive. Comparison of the UTF8 and UTF32 version here. I don't understand why the second version fails after the first \alpha.Illuse
@Illuse I'll look at that later tonight.Elmore
@Illuse interestingly, in my tests, the first version failed after the first 'a'. It has to do with attribute propagation; if the symbols yields the same type (std::string) as the enclosing, it gets assigned instead of appended (I feel this is a bug). So, instead, I'd use std::vector<char> as the attribute, and it works correctly. Here's some cleaned up code: coliru.stacked-crooked.com/a/b9555dfd246b5252(note the reinterpret_cast<> business looked wrong, I changed it).Elmore
@Illuse maybe you should post this as a separate question. I'll try to remember to ask on the mailing list about this behaviour. The live stream is here: livecoding.tv/video/… (first part missing due technical problems)Elmore
I ended up choosing a user-defined-string-literal that creates a std::array. Avoids this maybe-bug, is (in principle) a compile time codepoint->UTF8 conversion, and can be extended to composed characters without much fuss. The code I ended up with (for now) is here. I'm going to parse this to some AST representation, from which I'll synthesize some limited form of Qt's supported HTML for starters. Thanks for the insight though.Illuse

© 2022 - 2024 — McMap. All rights reserved.