How can I match utf8 unicode characters using boost::spirit
?
For example, I want to recognize all characters in this string:
$ echo "На берегу пустынных волн" | ./a.out
Н а б е р е гу п у с т ы н н ы х в о л н
When I try this simple boost::spirit
program it will not match the unicode characters correctly:
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/support_istream_iterator.hpp>
#include <boost/foreach.hpp>
namespace qi = boost::spirit::qi;
int main() {
std::cin.unsetf(std::ios::skipws);
boost::spirit::istream_iterator begin(std::cin);
boost::spirit::istream_iterator end;
std::vector<char> letters;
bool result = qi::phrase_parse(
begin, end, // input
+qi::char_, // match every character
qi::space, // skip whitespace
letters); // result
BOOST_FOREACH(char letter, letters) {
std::cout << letter << " ";
}
std::cout << std::endl;
}
It behaves like this:
$ echo "На берегу пустынных волн" | ./a.out | less
<D0> <9D> <D0> <B0> <D0> <B1> <D0> <B5> <D1> <80> <D0> <B5> <D0> <B3> <D1> <83> <D0> <BF> <D1> <83> <D1> <81> <D1> <82> <D1> <8B> <D0> <BD> <D0> <BD> <D1> <8B> <D1> <85> <D0>
<B2> <D0> <BE> <D0> <BB> <D0> <BD>
UPDATE:
Okay, I worked on this a bit more, and the following code is sort of working. It first converts the input into an iterator of 32-bit unicode characters (as recommended here):
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/support_istream_iterator.hpp>
#include <boost/foreach.hpp>
#include <boost/regex/pending/unicode_iterator.hpp>
namespace qi = boost::spirit::qi;
int main() {
std::string str = "На берегу пустынных волн";
boost::u8_to_u32_iterator<std::string::const_iterator>
begin(str.begin()), end(str.end());
typedef boost::uint32_t uchar; // a unicode code point
std::vector<uchar> letters;
bool result = qi::phrase_parse(
begin, end, // input
+qi::standard_wide::char_, // match every character
qi::space, // skip whitespace
letters); // result
BOOST_FOREACH(uchar letter, letters) {
std::cout << letter << " ";
}
std::cout << std::endl;
}
The code prints the Unicode code points:
$ ./a.out
1053 1072 1073 1077 1088 1077 1075 1091 1087 1091 1089 1090 1099 1085 1085 1099 1093 1074 1086 1083 1085
which seems to be correct, according to the official Unicode table.
Now, can anyone tell me how to print the actual characters instead, given this vector of Unicode code points?
boost::spirit::unicode
are used here (boost-spirit.com/dl_more/scheme/scheme_v0.2/sexpr.hpp), but I don't know what Spirit version this needs. Mine is from boost 1.49, and it doesn't haveboost::spirit::unicode
. – Doxology#define BOOST_SPIRIT_UNICODE
– Prichard