In an application, I basically want to have a "pre-parsing" phase where I adjust the token stream before a Qi parser can see it.
One way to do this would be to have some kind of "lexer adaptor" which is constructed from a lexer
and is itself a lexer
, which wraps and modifies the behavior of the inner lexer
. However it would be simpler and easier to debug if instead I just lex the entire input stream with the inner lexer
first and store the results in a std::vector<token_type>
, then modify as desired, then pass the result to the parser. (In my application I don't think that there would even be any performance concern with this.)
In an email exchange from a few years back, someone described exactly this question and Hartmut said that it should be trivial. http://comments.gmane.org/gmane.comp.parsers.spirit.general/24899
However I didn't find any code examples or instructions how to do this beyond, look at the headers in spirit::lex
and figure it out. That will likely occupy me for quite a while now unless you, dear reader, can assist.
The specific question is, how can I make a "shim" lexer which wraps over a pair of std::vector<token_type>::iterator
's and looks to spirit::qi
just like a standard spirit::lex
lexer
.
Edit: To be clear, this is not a duplicate of this question: Using Boost.Spirit.Qi with custom lexer
My token_type
s are attributed, and the details of the extra things that Hartmut says I need to do are the substance of this question.
Edit: Okay, I made an SSCCE. This version does not have attributed lexer tokens, but even without that I still can't get it to work yet, and this seems like as good an SSCCE to get started anyways.
Highlights:
"Token buffer" type:
template<typename TokenType>
struct token_buffer {
std::vector<TokenType> tokens_;
token_buffer() = default;
bool operator()(token_type t) {
tokens_.push_back(t);
return true;
}
void print(std::ostream & o) const { ... }
};
My first attempt at making a "buffer lexer" which looks like a lex::lexer to Qi, but in fact serves tokens from a buffer. This one derives from lex_basic above, I'm not sure if that's correct.
template<typename LexerType>
class buffer_lexer : public lex_basic<LexerType> {
public:
typedef std::vector<token_type> buff_type;
typedef typename buff_type::const_iterator iterator_type;
private:
const buff_type & buff_;
public:
buffer_lexer(const buff_type & b) : lex_basic<LexerType>(), buff_(b) {}
iterator_type begin() const { return buff_.begin(); }
iterator_type end() const { return buff_.end(); }
// for consistency with regular lexer `begin` signature, not sure if this is needed
template<typename T>
iterator_type begin(T, T) { return begin(); }
};
My second attempt at making a buffer lexer. This one does not derive from lex_basic
and instead tries to follow these instructions found in the header boost/spirit/home/lex/lexer/lexertl/lexer.hpp
:
///////////////////////////////////////////////////////////////////////////
//
// Every lexer type to be used as a lexer for Spirit has to conform to
// the following public interface:
//
// typedefs:
// iterator_type The type of the iterator exposed by this lexer.
// token_type The type of the tokens returned from the exposed
// iterators.
//
// functions:
// default constructor
// Since lexers are instantiated as base classes
// only it might be a good idea to make this
// constructor protected.
// begin, end Return a pair of iterators, when dereferenced
// returning the sequence of tokens recognized in
// the input stream given as the parameters to the
// begin() function.
// add_token Should add the definition of a token to be
// recognized by this lexer.
// clear Should delete all current token definitions
// associated with the given state of this lexer
// object.
//
// template parameters:
// Iterator The type of the iterator used to access the
// underlying character stream.
// Token The type of the tokens to be returned from the
// exposed token iterator.
// Functor The type of the InputPolicy to use to instantiate
// the multi_pass iterator type to be used as the
// token iterator (returned from begin()/end()).
//
///////////////////////////////////////////////////////////////////////////
Here's the "buffer_lexer_raw" that I came up with:
template<typename Iterator,
typename TokenType,
typename Functor = lex::lexertl::functor<TokenType, lex::lexertl::detail::data, Iterator>>
class buffer_lexer_raw {
typedef TokenType token_type;
typedef std::vector<token_type> buff_type;
typedef typename buff_type::const_iterator iterator_type;
typedef typename boost::detail::iterator_traits<typename token_type::iterator_type>::value_type char_type;
private:
buff_type buff_;
public:
buffer_lexer_raw() {}
void set_buffer(const buff_type & b) { buff_ = b; }
iterator_type begin() const { return buff_.begin(); }
iterator_type end() const { return buff_.end(); }
// for consistency with regular lexer `begin` signature, not sure if this is needed
template<typename T>
iterator_type begin(T, T) { return begin(); }
std::size_t add_token(char_type const* state, char_type tokendef,
std::size_t token_id, char_type const* targetstate)
{
return 1;
}
void clear(char_type const* state) {}
};
The test code responds to a macro defined at the top of the file.
// Use the type "buffer_lexer" which derives from lex_basic<Lexer>
//#define WHICH_LEXER_TYPE 1
// Use the type "buffer_lexer_raw" which does not derive from anything
//#define WHICH_LEXER_TYPE 2
// Use the "placebo" lexer, which is just lex_basic<Lexer>, as a sanity test of our lex:: api calls
#define WHICH_LEXER_TYPE 0
The test code will:
lex::tokenize_and_parse
, and dump the resulting AST.qi::parse
. It will check that the resulting AST is the same as the AST generated the "easy" way.Currently the #define WHICH_LEXER_TYPE 0
option compiles and works great for me with both gcc-4.8 and clang-3.6.
I can't actually get it to compile with the #define WHICH_LEXER_TYPE 1
or #define WHICH_LEXER_TYPE 2
options. With type 1, clang gives the following error message which I don't have the foggiest idea about:
In file included from main.cpp:1:
In file included from /usr/include/boost/spirit/include/lex_lexertl.hpp:16:
In file included from /usr/include/boost/spirit/home/lex/lexer_lexertl.hpp:15:
In file included from /usr/include/boost/spirit/home/lex.hpp:13:
In file included from /usr/include/boost/spirit/home/lex/lexer.hpp:14:
In file included from /usr/include/boost/spirit/home/lex/lexer/token_def.hpp:21:
In file included from /usr/include/boost/spirit/home/lex/reference.hpp:16:
/usr/include/boost/spirit/home/qi/reference.hpp:43:30: error: no matching member function for call to 'parse'
return ref.get().parse(first, last, context, skipper, attr);
~~~~~~~~~~^~~~~
/usr/include/boost/spirit/home/qi/parse.hpp:86:42: note: in instantiation of function template specialization 'boost::spirit::qi::reference<const
boost::spirit::qi::rule<boost::spirit::lex::lexertl::iterator<boost::spirit::lex::lexertl::functor<boost::spirit::lex::lexertl::token<__gnu_cxx::__normal_iterator<const
char *, std::basic_string<char> >, boost::mpl::vector<char, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, mpl_::bool_<true>, unsigned long>, lexertl::detail::data,
__gnu_cxx::__normal_iterator<const char *, std::basic_string<char> >, mpl_::bool_<false>, mpl_::bool_<true> > >, ast::Body (),
boost::spirit::locals<std::basic_string<char>, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>,
boost::spirit::unused_type, boost::spirit::unused_type> >::parse<__gnu_cxx::__normal_iterator<const
boost::spirit::lex::lexertl::token<__gnu_cxx::__normal_iterator<const char *, std::basic_string<char> >, boost::mpl::vector<char, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>,
mpl_::bool_<true>, unsigned long> *, std::vector<boost::spirit::lex::lexertl::token<__gnu_cxx::__normal_iterator<const char *, std::basic_string<char> >,
boost::mpl::vector<char, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, mpl_::bool_<true>, unsigned long>,
std::allocator<boost::spirit::lex::lexertl::token<__gnu_cxx::__normal_iterator<const char *, std::basic_string<char> >, boost::mpl::vector<char, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na>, mpl_::bool_<true>, unsigned long> > > >, boost::spirit::context<boost::fusion::cons<ast::Body &, boost::fusion::nil>,
boost::spirit::locals<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na> >, boost::spirit::unused_type,
ast::Body>' requested here
return compile<qi::domain>(expr).parse(first, last, context, unused, attr);
^
main.cpp:414:12: note: in instantiation of function template specialization 'boost::spirit::qi::parse<__gnu_cxx::__normal_iterator<const
boost::spirit::lex::lexertl::token<__gnu_cxx::__normal_iterator<const char *, std::basic_string<char> >, boost::mpl::vector<char, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>,
mpl_::bool_<true>, unsigned long> *, std::vector<boost::spirit::lex::lexertl::token<__gnu_cxx::__normal_iterator<const char *, std::basic_string<char> >,
boost::mpl::vector<char, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, mpl_::bool_<true>, unsigned long>,
std::allocator<boost::spirit::lex::lexertl::token<__gnu_cxx::__normal_iterator<const char *, std::basic_string<char> >, boost::mpl::vector<char, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na>, mpl_::bool_<true>, unsigned long> > > >,
basic_grammar<boost::spirit::lex::lexertl::iterator<boost::spirit::lex::lexertl::functor<boost::spirit::lex::lexertl::token<__gnu_cxx::__normal_iterator<const
char *, std::basic_string<char> >, boost::mpl::vector<char, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, mpl_::bool_<true>, unsigned long>, lexertl::detail::data,
__gnu_cxx::__normal_iterator<const char *, std::basic_string<char> >, mpl_::bool_<false>, mpl_::bool_<true> > > >, ast::Body>' requested here
if (!qi::parse(it, fin, bgram, tree2)) {
^
/usr/include/boost/spirit/home/qi/nonterminal/rule.hpp:273:14: note: candidate function [with Context = boost::spirit::context<boost::fusion::cons<ast::Body &,
boost::fusion::nil>, boost::spirit::locals<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na> >, Skipper =
boost::spirit::unused_type, Attribute = ast::Body] not viable: no known conversion from '__gnu_cxx::__normal_iterator<const
boost::spirit::lex::lexertl::token<__gnu_cxx::__normal_iterator<const char *, std::basic_string<char> >, boost::mpl::vector<char, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>,
mpl_::bool_<true>, unsigned long> *, std::vector<boost::spirit::lex::lexertl::token<__gnu_cxx::__normal_iterator<const char *, std::basic_string<char> >,
boost::mpl::vector<char, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, mpl_::bool_<true>, unsigned long>,
std::allocator<boost::spirit::lex::lexertl::token<__gnu_cxx::__normal_iterator<const char *, std::basic_string<char> >, boost::mpl::vector<char, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na>, mpl_::bool_<true>, unsigned long> > > >' to
'boost::spirit::lex::lexertl::iterator<boost::spirit::lex::lexertl::functor<boost::spirit::lex::lexertl::token<__gnu_cxx::__normal_iterator<const char *,
std::basic_string<char> >, boost::mpl::vector<char, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, mpl_::bool_<true>, unsigned long>, lexertl::detail::data,
__gnu_cxx::__normal_iterator<const char *, std::basic_string<char> >, mpl_::bool_<false>, mpl_::bool_<true> > > &' for 1st argument
bool parse(Iterator& first, Iterator const& last
^
/usr/include/boost/spirit/home/qi/nonterminal/rule.hpp:319:14: note: candidate function template not viable: requires 6 arguments, but 5 were provided
bool parse(Iterator& first, Iterator const& last
^
1 error generated.
The "2" option gives essentially the same error message. gcc doesn't seem to give a better error message.
Here's the complete source code:
#include <boost/spirit/include/lex_lexertl.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/fusion/include/adapt_struct.hpp>
#include <boost/fusion/include/std_pair.hpp>
#include <boost/variant/get.hpp>
#include <boost/variant/variant.hpp>
#include <boost/variant/recursive_variant.hpp>
#include <boost/preprocessor/stringize.hpp>
#include <vector>
#include <string>
typedef unsigned int uint;
namespace lex = boost::spirit::lex;
namespace qi = boost::spirit::qi;
namespace mpl = boost::mpl;
// Use the type "buffer_lexer" which derives from lex_basic<Lexer>
//#define WHICH_LEXER_TYPE 1
// Use the type "buffer_lexer_raw" which does not derive from anything
//#define WHICH_LEXER_TYPE 2
// Use the "placebo" lexer, which is just lex_basic<Lexer>, as a sanity test of
// our lex:: api calls
#define WHICH_LEXER_TYPE 0
//// Lexer definition
enum tokenids {
LCARET = lex::min_token_id + 10,
RCARET,
BSLASH,
LBRACE,
RBRACE,
LPAREN,
RPAREN,
EQUALS,
USCORE,
ALPHA,
NUM,
EOL,
BLANK,
IDANY
};
#define TOKEN_CASE(X) \
case X: return #X
const char *token_id_string(size_t id) {
switch (id) {
TOKEN_CASE(LCARET);
TOKEN_CASE(RCARET);
TOKEN_CASE(BSLASH);
TOKEN_CASE(LBRACE);
TOKEN_CASE(RBRACE);
TOKEN_CASE(LPAREN);
TOKEN_CASE(RPAREN);
TOKEN_CASE(EQUALS);
TOKEN_CASE(USCORE);
TOKEN_CASE(ALPHA);
TOKEN_CASE(NUM);
TOKEN_CASE(EOL);
TOKEN_CASE(BLANK);
TOKEN_CASE(IDANY);
default:
return "Unknown token";
}
}
template <typename Lexer> struct lex_basic : lex::lexer<Lexer> {
lex_basic() {
this->self.add
('<', LCARET)
('>', RCARET)
('/', BSLASH)
('{', LBRACE)
('}', RBRACE)
('(', LPAREN)
(')', RPAREN)
('=', EQUALS)
('_', USCORE)
("[A-Za-z]", ALPHA)
("[0-9]", NUM)
('\n', EOL)
("[ \\t\\r]", BLANK)
(".", IDANY);
}
};
typedef std::string::const_iterator str_it;
// the token type needs to know the iterator type of the underlying
// input and the set of used token value types
typedef lex::lexertl::token<str_it, mpl::vector<char>> token_type;
template <typename TokenType> struct token_buffer {
std::vector<TokenType> tokens_;
token_buffer() = default;
bool operator()(token_type t) {
tokens_.push_back(t);
return true;
}
void print(std::ostream &o) const {
std::cout << "tokens_.size() == " << tokens_.size() << std::endl;
for (size_t i = 0; i < tokens_.size(); ++i) {
const TokenType &t = tokens_[i];
o << "[" << i << "]: -" << token_id_string(t.id()) << "- \"" << t
<< "\" [";
const auto &v = t.value();
if (t.id() == EOL) {
o << "\\n";
} else {
o << v;
}
o << "]" << std::endl;
}
}
};
/***
* Lexers which serve tokens from a buffer
*/
// Two versions of the same thing, one deriving from lex::lexer, one not
template <typename LexerType> class buffer_lexer : public lex_basic<LexerType> {
public:
typedef std::vector<token_type> buff_type;
typedef typename buff_type::const_iterator iterator_type;
private:
const buff_type &buff_;
public:
buffer_lexer(const buff_type &b) : lex_basic<LexerType>(), buff_(b) {}
iterator_type begin() const { return buff_.begin(); }
iterator_type end() const { return buff_.end(); }
// for consistency with regular lexer `begin` signature, not sure if this is
// needed
template <typename T> iterator_type begin(T, T) { return begin(); }
};
template <typename Iterator, typename TokenType,
typename Functor = lex::lexertl::functor<
TokenType, lex::lexertl::detail::data, Iterator>>
class buffer_lexer_raw {
typedef TokenType token_type;
typedef std::vector<token_type> buff_type;
typedef typename buff_type::const_iterator iterator_type;
typedef typename boost::detail::iterator_traits<
typename token_type::iterator_type>::value_type char_type;
private:
buff_type buff_;
public:
buffer_lexer_raw() {}
void set_buffer(const buff_type &b) { buff_ = b; }
iterator_type begin() const { return buff_.begin(); }
iterator_type end() const { return buff_.end(); }
// for consistency with regular lexer `begin` signature, not sure if this is
// needed
template <typename T> iterator_type begin(T, T) { return begin(); }
std::size_t add_token(char_type const *state, char_type tokendef,
std::size_t token_id, char_type const *targetstate) {
return 1;
}
void clear(char_type const *state) {}
};
/***
* AST
*/
namespace ast {
typedef std::string Str;
struct BraceExpr;
typedef boost::variant<Str, boost::recursive_wrapper<BraceExpr>> BraceExprArg;
struct BraceExpr {
std::vector<BraceExprArg> args;
};
typedef std::pair<Str, Str> Pair;
struct Body;
typedef boost::variant<Pair, BraceExpr, boost::recursive_wrapper<Body>> Node;
struct Body {
Str key;
std::vector<Node> nodes;
};
} // end namespace ast
BOOST_FUSION_ADAPT_STRUCT(ast::BraceExpr,
(std::vector<ast::BraceExprArg>, args))
BOOST_FUSION_ADAPT_STRUCT(ast::Body,
(ast::Str, key)(std::vector<ast::Node>, nodes))
namespace ast {
// Stream ops
class printer : public boost::static_visitor<> {
std::ostream &ss_;
uint indent_;
std::string indent(uint extra = 0) const {
return std::string(indent_ + extra, ' ');
}
std::string indent_plus_tab() const { return indent(tab_width); }
public:
static constexpr uint tab_width = 4;
explicit printer(std::ostream &s, uint indent = 0)
: ss_(s), indent_(indent) {}
void operator()(const Str &s) const { ss_ << s; }
void operator()(const BraceExpr &b) const {
ss_ << "{";
for (size_t i = 0; i < b.args.size(); ++i) {
if (i) {
ss_ << " ";
}
boost::apply_visitor(*this, b.args[i]);
}
ss_ << "}";
}
void operator()(const Pair &p) const { ss_ << p.first << " = " << p.second; }
void operator()(const Body &b) const {
ss_ << indent() << "<" << b.key << ">\n";
printer p{ss_, indent_ + tab_width};
for (const auto &n : b.nodes) {
ss_ << indent_plus_tab();
boost::apply_visitor(p, n);
ss_ << "\n";
}
ss_ << indent() << "</" << b.key << ">";
}
};
std::ostream &operator<<(std::ostream &ss, const BraceExpr &b) {
printer p{ss};
p(b);
return ss;
}
std::ostream &operator<<(std::ostream &ss, const Pair &p) {
printer pr{ss};
pr(p);
return ss;
}
std::ostream &operator<<(std::ostream &ss, const Body &b) {
printer p{ss};
p(b);
return ss;
}
// Equality ops
bool operator==(const Pair &p1, const Pair &p2) {
return p1.first == p2.first && p1.second == p2.second;
}
bool operator==(const BraceExpr &b1, const BraceExpr &b2) {
return b1.args == b2.args;
}
bool operator==(const Body &b1, const Body &b2) {
return b1.key == b2.key && b1.nodes == b2.nodes;
}
bool operator!=(const Pair &p1, const Pair &p2) { return !(p1 == p2); }
bool operator!=(const BraceExpr &b1, const BraceExpr &b2) {
return !(b1 == b2);
}
bool operator!=(const Body &b1, const Body &b2) { return !(b1 == b2); }
} // end namespace ast
/***
* Grammar
*/
template <typename Iterator>
struct basic_grammar
: qi::grammar<Iterator, ast::Body(), qi::locals<ast::Str>> {
qi::rule<Iterator, ast::Body(), qi::locals<ast::Str>> body;
qi::rule<Iterator, ast::Node()> node;
qi::rule<Iterator, ast::Pair()> pair;
qi::rule<Iterator, ast::BraceExprArg()> brace_expr_arg;
qi::rule<Iterator, ast::BraceExpr()> brace_expr;
qi::rule<Iterator, ast::Str()> identifier;
qi::rule<Iterator, ast::Str()> str;
qi::rule<Iterator, ast::Str()> open_tag;
qi::rule<Iterator /*, ast::Str()*/> close_tag;
qi::rule<Iterator> lbrace;
qi::rule<Iterator> rbrace;
qi::rule<Iterator> equals;
qi::rule<Iterator> ws;
template <typename TokenDef>
basic_grammar(const TokenDef &tok)
: basic_grammar::base_type(body, "body") {
using namespace qi;
ws %= token(BLANK) | token(EOL);
lbrace %= token(LBRACE);
rbrace %= token(RBRACE);
equals %= token(EQUALS);
identifier %= token(ALPHA) >> *(token(ALPHA) | token(NUM) | token(USCORE));
str %= *(token(LCARET) | token(RCARET) | token(BSLASH) | token(LPAREN) |
token(RPAREN) | token(ALPHA) | token(NUM) | token(USCORE) |
token(EQUALS) | token(BLANK) | token(IDANY));
open_tag %= omit[token(LCARET)] >> identifier >>
omit[token(RCARET)]; // tok.open_tag;
close_tag %= omit[token(LCARET) >> token(BSLASH)] >> identifier >>
omit[token(RCARET)]; // tok.close_tag;
pair = skip(boost::proto::deep_copy(ws))[identifier >> equals >> str];
body = skip(boost::proto::deep_copy(ws))[open_tag >> *node >> close_tag];
node = brace_expr | body | pair;
brace_expr_arg = brace_expr | identifier;
brace_expr =
skip(boost::proto::deep_copy(ws))[lbrace >> *brace_expr_arg >> rbrace];
}
};
/***
* Usage / Tests
*/
// use actor_lexer<> here if your token definitions have semantic
// actions
typedef lex::lexertl::lexer<token_type> lexer_type;
// this is the iterator exposed by the lexer, we use this for parsing
typedef lexer_type::iterator_type iterator_type;
token_buffer<token_type> test_lexer(const std::string &input,
bool silent = false) {
str_it s = input.begin();
str_it end = input.end();
// create a lexer instance
lex_basic<lexer_type> lex;
token_buffer<token_type> buff;
if (!lex::tokenize(s, end, lex, [&](token_type t) { return buff(t); })) {
if (!silent) {
std::cout << "\nTokenizing failed!" << std::endl;
}
} else {
if (!silent) {
std::cout << "\nTokenizing succeeded!" << std::endl;
}
}
if (!silent) {
buff.print(std::cout);
}
return buff;
}
void test_grammar(const std::string &input) {
lex_basic<lexer_type> lex;
basic_grammar<iterator_type> gram{lex};
ast::Body tree;
{
str_it s = input.begin();
str_it end = input.end();
if (!lex::tokenize_and_parse(s, end, lex, gram, tree)) {
std::cout << "\nParsing failed!" << std::endl;
} else {
std::cout << "\nParsing succeeded!" << std::endl;
}
std::cout << tree << std::endl;
}
// Now try to do it in two steps, with buffered lexer
auto buff = test_lexer(input, true); // get buffer, silence output
#if WHICH_LEXER_TYPE == 1
buffer_lexer<lexer_type> blex{buff.tokens_};
#else
#if WHICH_LEXER_TYPE == 2
buffer_lexer_raw<str_it, token_type> blex;
blex.set_buffer(buff.tokens_);
#else
lex_basic<lexer_type> blex;
#endif
#endif
basic_grammar<iterator_type> bgram{blex};
ast::Body tree2;
{
#if (WHICH_LEXER_TYPE == 1) || (WHICH_LEXER_TYPE == 2)
auto it = blex.begin();
#else
str_it s = input.begin();
str_it end = input.end();
auto it = blex.begin(s, end);
#endif
auto fin = blex.end();
if (!qi::parse(it, fin, bgram, tree2)) {
std::cout << "\nBuffered parsing failed!" << std::endl;
} else {
std::cout << "\nBuffered parsing succeeded!" << std::endl;
}
}
std::cout << tree2 << std::endl;
if (tree != tree2) {
std::cout << "\nRegular parsing vs. buffered parsing mismatch!"
<< std::endl;
}
}
int main() {
std::string input{""
"<asdf>\n"
"foo = bar\n"
"{F foo}\n"
"{G {F foo} {H bar}}\n"
"</asdf>\n"};
test_lexer(input);
// Use lexer and grammar at once as demonstrated in tutorials
std::string input2 = "<asdf></asdf>";
test_grammar(input2);
test_grammar(input);
std::string input3{""
"<asdf>\n"
"foo = bar\n"
"{F foo}\n"
"{G {F foo} {H bar}}\n"
"<jkl>\n"
"baz = gaz\n"
"{H {H H} {{{H} {G} {F foo}} {B ar}} {Q i}}\n"
"</jkl>\n"
"</asdf>\n"};
test_grammar(input3);
return 0;
}
I too thought multi-pass was to blame, but after much fiddling, I was able to get it to work with 2 easy fixes ¹
template <typename Iterator, typename TokenType,
typename Functor = lex::lexertl::functor<
TokenType, lex::lexertl::detail::data, Iterator>>
class buffer_lexer_raw {
typedef TokenType token_type;
typedef std::vector<token_type> buff_type;
typedef typename buff_type::const_iterator base_iterator_type;
public:
struct iterator_type : base_iterator_type {
typedef base_iterator_type base_iterator_type;
using base_iterator_type::base_iterator_type;
};
typedef char char_type;
This ensures that the nested iterator_type
has itself a base_iterator_type
type. This appears to be required somewhere down in the bowels of the library (likely due to assumptions about token iterators).
The second part is where the grammar is actually instantiated, don't use the "plain" iterator, but the one we just defined:
basic_grammar<concrete_lexer_type::iterator_type> bgram{blex};
Fully working listing:
#include <boost/spirit/include/lex_lexertl.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/fusion/include/adapt_struct.hpp>
#include <boost/fusion/include/std_pair.hpp>
#include <boost/variant/get.hpp>
#include <boost/variant/variant.hpp>
#include <boost/variant/recursive_variant.hpp>
#include <boost/preprocessor/stringize.hpp>
#include <vector>
#include <string>
typedef unsigned int uint;
namespace lex = boost::spirit::lex;
namespace qi = boost::spirit::qi;
namespace mpl = boost::mpl;
//// Lexer definition
enum tokenids {
LCARET = lex::min_token_id + 10,
RCARET,
BSLASH,
LBRACE,
RBRACE,
LPAREN,
RPAREN,
EQUALS,
USCORE,
ALPHA,
NUM,
EOL,
BLANK,
IDANY
};
#define TOKEN_CASE(X) \
case X: return #X
const char *token_id_string(size_t id) {
switch (id) {
TOKEN_CASE(LCARET);
TOKEN_CASE(RCARET);
TOKEN_CASE(BSLASH);
TOKEN_CASE(LBRACE);
TOKEN_CASE(RBRACE);
TOKEN_CASE(LPAREN);
TOKEN_CASE(RPAREN);
TOKEN_CASE(EQUALS);
TOKEN_CASE(USCORE);
TOKEN_CASE(ALPHA);
TOKEN_CASE(NUM);
TOKEN_CASE(EOL);
TOKEN_CASE(BLANK);
TOKEN_CASE(IDANY);
default:
return "Unknown token";
}
}
template <typename Lexer> struct lex_basic : lex::lexer<Lexer> {
lex_basic() {
this->self.add
('<', LCARET)
('>', RCARET)
('/', BSLASH)
('{', LBRACE)
('}', RBRACE)
('(', LPAREN)
(')', RPAREN)
('=', EQUALS)
('_', USCORE)
("[A-Za-z]", ALPHA)
("[0-9]", NUM)
('\n', EOL)
("[ \\t\\r]", BLANK)
(".", IDANY);
}
};
typedef std::string::const_iterator str_it;
// the token type needs to know the iterator type of the underlying
// input and the set of used token value types
typedef lex::lexertl::token<str_it, mpl::vector<char>> token_type;
template <typename TokenType> struct token_buffer {
std::vector<TokenType> tokens_;
token_buffer() = default;
bool operator()(token_type t) {
tokens_.push_back(t);
return true;
}
void print(std::ostream &o) const {
std::cout << "tokens_.size() == " << tokens_.size() << std::endl;
for (size_t i = 0; i < tokens_.size(); ++i) {
const TokenType &t = tokens_[i];
o << "[" << i << "]: -" << token_id_string(t.id()) << "- \"" << t
<< "\" [";
const auto &v = t.value();
if (t.id() == EOL) {
o << "\\n";
} else {
o << v;
}
o << "]" << std::endl;
}
}
};
/***
* Lexers which serve tokens from a buffer
*/
// Two versions of the same thing, one deriving from lex::lexer, one not
template <typename LexerType> class buffer_lexer : public lex_basic<LexerType> {
public:
typedef std::vector<token_type> buff_type;
typedef typename buff_type::const_iterator iterator_type;
private:
const buff_type &buff_;
public:
buffer_lexer(const buff_type &b) : lex_basic<LexerType>(), buff_(b) {}
iterator_type begin() const { return buff_.begin(); }
iterator_type end() const { return buff_.end(); }
// for consistency with regular lexer `begin` signature, not sure if this is
// needed
template <typename T> iterator_type begin(T, T) { return begin(); }
};
template <typename Iterator, typename TokenType,
typename Functor = lex::lexertl::functor<
TokenType, lex::lexertl::detail::data, Iterator>>
class buffer_lexer_raw {
typedef TokenType token_type;
typedef std::vector<token_type> buff_type;
typedef typename buff_type::const_iterator vec_iterator_type;
public:
struct iterator_type : vec_iterator_type {
typedef vec_iterator_type base_iterator_type;
using vec_iterator_type::vec_iterator_type;
};
typedef char char_type;
private:
buff_type buff_;
public:
buffer_lexer_raw() {}
void set_buffer(const buff_type &b) { buff_ = b; }
iterator_type begin() const { return buff_.begin(); }
iterator_type end() const { return buff_.end(); }
// for consistency with regular lexer `begin` signature, not sure if this is
// needed
template <typename T> iterator_type begin(T, T) { return begin(); }
std::size_t add_token(char_type const*, char_type, std::size_t, char_type const*) {
return 1;
}
void clear(char_type const *) {}
};
/***
* AST
*/
namespace ast {
typedef std::string Str;
struct BraceExpr;
typedef boost::variant<Str, boost::recursive_wrapper<BraceExpr>> BraceExprArg;
struct BraceExpr {
std::vector<BraceExprArg> args;
};
typedef std::pair<Str, Str> Pair;
struct Body;
typedef boost::variant<Pair, BraceExpr, boost::recursive_wrapper<Body>> Node;
struct Body {
Str key;
std::vector<Node> nodes;
};
} // end namespace ast
BOOST_FUSION_ADAPT_STRUCT(ast::BraceExpr,
(std::vector<ast::BraceExprArg>, args))
BOOST_FUSION_ADAPT_STRUCT(ast::Body,
(ast::Str, key)(std::vector<ast::Node>, nodes))
namespace ast {
// Stream ops
class printer : public boost::static_visitor<> {
std::ostream &ss_;
uint indent_;
std::string indent(uint extra = 0) const { return std::string(indent_ + extra, ' '); }
std::string indent_plus_tab() const { return indent(tab_width); }
public:
static constexpr uint tab_width = 4;
explicit printer(std::ostream &s, uint indent = 0)
: ss_(s), indent_(indent) {}
void operator()(const Str &s) const { ss_ << s; }
void operator()(const BraceExpr &b) const {
ss_ << "{";
for (size_t i = 0; i < b.args.size(); ++i) {
if (i) {
ss_ << " ";
}
boost::apply_visitor(*this, b.args[i]);
}
ss_ << "}";
}
void operator()(const Pair &p) const { ss_ << p.first << " = " << p.second; }
void operator()(const Body &b) const {
ss_ << indent() << "<" << b.key << ">\n";
printer p{ss_, indent_ + tab_width};
for (const auto &n : b.nodes) {
ss_ << indent_plus_tab();
boost::apply_visitor(p, n);
ss_ << "\n";
}
ss_ << indent() << "</" << b.key << ">";
}
};
std::ostream &operator<<(std::ostream &ss, const BraceExpr &b) {
printer p{ss};
p(b);
return ss;
}
std::ostream &operator<<(std::ostream &ss, const Pair &p) {
printer pr{ss};
pr(p);
return ss;
}
std::ostream &operator<<(std::ostream &ss, const Body &b) {
printer p{ss};
p(b);
return ss;
}
// Equality ops
bool operator==(const Pair &p1, const Pair &p2) {
return p1.first == p2.first && p1.second == p2.second;
}
bool operator==(const BraceExpr &b1, const BraceExpr &b2) {
return b1.args == b2.args;
}
bool operator==(const Body &b1, const Body &b2) {
return b1.key == b2.key && b1.nodes == b2.nodes;
}
bool operator!=(const Pair &p1, const Pair &p2) { return !(p1 == p2); }
bool operator!=(const BraceExpr &b1, const BraceExpr &b2) {
return !(b1 == b2);
}
bool operator!=(const Body &b1, const Body &b2) { return !(b1 == b2); }
} // end namespace ast
/***
* Grammar
*/
template <typename Iterator>
struct basic_grammar : qi::grammar<Iterator, ast::Body(), qi::locals<ast::Str>> {
qi::rule<Iterator, ast::Body(), qi::locals<ast::Str>> body;
qi::rule<Iterator, ast::Node()> node;
qi::rule<Iterator, ast::Pair()> pair;
qi::rule<Iterator, ast::BraceExprArg()> brace_expr_arg;
qi::rule<Iterator, ast::BraceExpr()> brace_expr;
qi::rule<Iterator, ast::Str()> identifier;
qi::rule<Iterator, ast::Str()> str;
qi::rule<Iterator, ast::Str()> open_tag;
qi::rule<Iterator /*, ast::Str()*/> close_tag;
qi::rule<Iterator> lbrace;
qi::rule<Iterator> rbrace;
qi::rule<Iterator> equals;
qi::rule<Iterator> ws;
template <typename TokenDef>
basic_grammar(const TokenDef &tok) : basic_grammar::base_type(body, "body") {
using namespace qi;
ws %= token(BLANK) | token(EOL);
lbrace %= token(LBRACE);
rbrace %= token(RBRACE);
equals %= token(EQUALS);
identifier %= token(ALPHA) >> *(token(ALPHA) | token(NUM) | token(USCORE));
str %= *(token(LCARET) | token(RCARET) | token(BSLASH) | token(LPAREN) |
token(RPAREN) | token(ALPHA) | token(NUM) | token(USCORE) |
token(EQUALS) | token(BLANK) | token(IDANY));
open_tag %= omit[token(LCARET)] >> identifier >> omit[token(RCARET)]; // tok.open_tag;
close_tag %= omit[token(LCARET) >> token(BSLASH)] >> identifier >> omit[token(RCARET)]; // tok.close_tag;
// TODO FIXME the deep_copy shoudl be not required there
/// bla_12 = somevalue
pair = skip(boost::proto::deep_copy(ws)) [ identifier >> equals >> str ] ;
/// <bla><sub>{some}{braced{expres}}sions</sub><pair1>key1=value</pair1></bla>
body = skip(boost::proto::deep_copy(ws)) [ open_tag >> *node >> close_tag ] ;
///
node = brace_expr | body | pair;
brace_expr_arg = brace_expr | identifier;
/// {{{bla}some{other}nested{id{entifier}s}}and such}
brace_expr = skip(boost::proto::deep_copy(ws))[lbrace >> *brace_expr_arg >> rbrace];
}
};
/***
* Usage / Tests
*/
// use actor_lexer<> here if your token definitions have semantic
// actions
typedef lex::lexertl::lexer<token_type> lexer_type;
// this is the iterator exposed by the lexer, we use this for parsing
typedef lexer_type::iterator_type iterator_type;
token_buffer<token_type> test_lexer(const std::string &input,
bool silent = false) {
str_it s = input.begin();
str_it end = input.end();
// create a lexer instance
lex_basic<lexer_type> lex;
token_buffer<token_type> buff;
if (!lex::tokenize(s, end, lex, [&](token_type t) { return buff(t); })) {
if (!silent) {
std::cout << "\nTokenizing failed!" << std::endl;
}
} else {
if (!silent) {
std::cout << "\nTokenizing succeeded!" << std::endl;
}
}
if (!silent) {
buff.print(std::cout);
}
return buff;
}
void test_grammar(const std::string &input) {
lex_basic<lexer_type> lex;
basic_grammar<iterator_type> gram{lex};
ast::Body tree;
{
str_it s = input.begin();
str_it end = input.end();
if (!lex::tokenize_and_parse(s, end, lex, gram, tree)) {
std::cout << "\nParsing failed!" << std::endl;
} else {
std::cout << "\nParsing succeeded!" << std::endl;
}
std::cout << tree << std::endl;
}
// Now try to do it in two steps, with buffered lexer
auto buff = test_lexer(input, true); // get buffer, silence output
typedef buffer_lexer_raw<str_it, token_type> concrete_lexer_type;
buffer_lexer_raw<str_it, token_type> blex;
blex.set_buffer(buff.tokens_);
basic_grammar<concrete_lexer_type::iterator_type> bgram{blex};
ast::Body tree2;
{
auto it = blex.begin();
auto fin = blex.end();
if (!qi::parse(it, fin, bgram, tree2)) {
std::cout << "\nBuffered parsing failed!" << std::endl;
} else {
std::cout << "\nBuffered parsing succeeded!" << std::endl;
}
}
std::cout << tree2 << std::endl;
if (tree != tree2) {
std::cout << "\nRegular parsing vs. buffered parsing mismatch!"
<< std::endl;
}
}
int main() {
std::string const input{""
"<asdf>\n"
"foo = bar\n"
"{F foo}\n"
"{G {F foo} {H bar}}\n"
"</asdf>\n"};
test_lexer(input);
// Use lexer and grammar at once as demonstrated in tutorials
std::string const input2 = "<asdf></asdf>";
test_grammar(input2);
test_grammar(input);
std::string const input3{""
"<asdf>\n"
"foo = bar\n"
"{F foo}\n"
"{G {F foo} {H bar}}\n"
"<jkl>\n"
"baz = gaz\n"
"{H {H H} {{{H} {G} {F foo}} {B ar}} {Q i}}\n"
"</jkl>\n"
"</asdf>\n"};
test_grammar(input3);
}
Printing:
Tokenizing succeeded!
tokens_.size() == 53
[0]: -LCARET- "65546" [<]
[1]: -ALPHA- "65555" [a]
[2]: -ALPHA- "65555" [s]
[3]: -ALPHA- "65555" [d]
[4]: -ALPHA- "65555" [f]
[5]: -RCARET- "65547" [>]
[6]: -EOL- "65557" [\n]
[7]: -ALPHA- "65555" [f]
[8]: -ALPHA- "65555" [o]
[9]: -ALPHA- "65555" [o]
[10]: -BLANK- "65558" [ ]
[11]: -EQUALS- "65553" [=]
[12]: -BLANK- "65558" [ ]
[13]: -ALPHA- "65555" [b]
[14]: -ALPHA- "65555" [a]
[15]: -ALPHA- "65555" [r]
[16]: -EOL- "65557" [\n]
[17]: -LBRACE- "65549" [{]
[18]: -ALPHA- "65555" [F]
[19]: -BLANK- "65558" [ ]
[20]: -ALPHA- "65555" [f]
[21]: -ALPHA- "65555" [o]
[22]: -ALPHA- "65555" [o]
[23]: -RBRACE- "65550" [}]
[24]: -EOL- "65557" [\n]
[25]: -LBRACE- "65549" [{]
[26]: -ALPHA- "65555" [G]
[27]: -BLANK- "65558" [ ]
[28]: -LBRACE- "65549" [{]
[29]: -ALPHA- "65555" [F]
[30]: -BLANK- "65558" [ ]
[31]: -ALPHA- "65555" [f]
[32]: -ALPHA- "65555" [o]
[33]: -ALPHA- "65555" [o]
[34]: -RBRACE- "65550" [}]
[35]: -BLANK- "65558" [ ]
[36]: -LBRACE- "65549" [{]
[37]: -ALPHA- "65555" [H]
[38]: -BLANK- "65558" [ ]
[39]: -ALPHA- "65555" [b]
[40]: -ALPHA- "65555" [a]
[41]: -ALPHA- "65555" [r]
[42]: -RBRACE- "65550" [}]
[43]: -RBRACE- "65550" [}]
[44]: -EOL- "65557" [\n]
[45]: -LCARET- "65546" [<]
[46]: -BSLASH- "65548" [/]
[47]: -ALPHA- "65555" [a]
[48]: -ALPHA- "65555" [s]
[49]: -ALPHA- "65555" [d]
[50]: -ALPHA- "65555" [f]
[51]: -RCARET- "65547" [>]
[52]: -EOL- "65557" [\n]
Parsing succeeded!
<asdf>
</asdf>
Buffered parsing succeeded!
<asdf>
</asdf>
Parsing succeeded!
<asdf>
foo = bar
{F foo}
{G {F foo} {H bar}}
</asdf>
Buffered parsing succeeded!
<asdf>
foo = bar
{F foo}
{G {F foo} {H bar}}
</asdf>
Parsing succeeded!
<asdf>
foo = bar
{F foo}
{G {F foo} {H bar}}
<jkl>
baz = gaz
{H {H H} {{{H} {G} {F foo}} {B ar}} {Q i}}
</jkl>
</asdf>
Buffered parsing succeeded!
<asdf>
foo = bar
{F foo}
{G {F foo} {H bar}}
<jkl>
baz = gaz
{H {H H} {{{H} {G} {F foo}} {B ar}} {Q i}}
</jkl>
</asdf>
¹ based on the buffer_lexer_raw
approach