c++c++11boostboost-spirit-qiboost-spirit-lex

How to use boost::spirit::qi with a std::vector<token_type> instead of std::string


In an application, I basically want to have a "pre-parsing" phase where I adjust the token stream before a Qi parser can see it.

One way to do this would be to have some kind of "lexer adaptor" which is constructed from a lexer and is itself a lexer, which wraps and modifies the behavior of the inner lexer. However it would be simpler and easier to debug if instead I just lex the entire input stream with the inner lexer first and store the results in a std::vector<token_type>, then modify as desired, then pass the result to the parser. (In my application I don't think that there would even be any performance concern with this.)

In an email exchange from a few years back, someone described exactly this question and Hartmut said that it should be trivial. http://comments.gmane.org/gmane.comp.parsers.spirit.general/24899

However I didn't find any code examples or instructions how to do this beyond, look at the headers in spirit::lex and figure it out. That will likely occupy me for quite a while now unless you, dear reader, can assist.

The specific question is, how can I make a "shim" lexer which wraps over a pair of std::vector<token_type>::iterator's and looks to spirit::qi just like a standard spirit::lex lexer.

Edit: To be clear, this is not a duplicate of this question: Using Boost.Spirit.Qi with custom lexer My token_types are attributed, and the details of the extra things that Hartmut says I need to do are the substance of this question.


Edit: Okay, I made an SSCCE. This version does not have attributed lexer tokens, but even without that I still can't get it to work yet, and this seems like as good an SSCCE to get started anyways.

Highlights:

"Token buffer" type:

template<typename TokenType>
struct token_buffer {
    std::vector<TokenType> tokens_;

    token_buffer() = default;

    bool operator()(token_type t) {
        tokens_.push_back(t);
        return true;
    }

    void print(std::ostream & o) const { ... }
};

My first attempt at making a "buffer lexer" which looks like a lex::lexer to Qi, but in fact serves tokens from a buffer. This one derives from lex_basic above, I'm not sure if that's correct.

template<typename LexerType>
class buffer_lexer : public lex_basic<LexerType> {
public:
    typedef std::vector<token_type> buff_type;
    typedef typename buff_type::const_iterator iterator_type;

private:
    const buff_type & buff_;

public:
    buffer_lexer(const buff_type & b) : lex_basic<LexerType>(), buff_(b) {}

    iterator_type begin() const { return buff_.begin(); }
    iterator_type end() const { return buff_.end(); }

    // for consistency with regular lexer `begin` signature, not sure if this is needed
    template<typename T>
    iterator_type begin(T, T) { return begin(); }
};

My second attempt at making a buffer lexer. This one does not derive from lex_basic and instead tries to follow these instructions found in the header boost/spirit/home/lex/lexer/lexertl/lexer.hpp:

///////////////////////////////////////////////////////////////////////////
//
//  Every lexer type to be used as a lexer for Spirit has to conform to
//  the following public interface:
//
//    typedefs:
//        iterator_type   The type of the iterator exposed by this lexer.
//        token_type      The type of the tokens returned from the exposed
//                        iterators.
//
//    functions:
//        default constructor
//                        Since lexers are instantiated as base classes
//                        only it might be a good idea to make this
//                        constructor protected.
//        begin, end      Return a pair of iterators, when dereferenced
//                        returning the sequence of tokens recognized in
//                        the input stream given as the parameters to the
//                        begin() function.
//        add_token       Should add the definition of a token to be
//                        recognized by this lexer.
//        clear           Should delete all current token definitions
//                        associated with the given state of this lexer
//                        object.
//
//    template parameters:
//        Iterator        The type of the iterator used to access the
//                        underlying character stream.
//        Token           The type of the tokens to be returned from the
//                        exposed token iterator.
//        Functor         The type of the InputPolicy to use to instantiate
//                        the multi_pass iterator type to be used as the
//                        token iterator (returned from begin()/end()).
//
///////////////////////////////////////////////////////////////////////////

Here's the "buffer_lexer_raw" that I came up with:

template<typename Iterator,
     typename TokenType,
     typename Functor = lex::lexertl::functor<TokenType, lex::lexertl::detail::data, Iterator>>
class buffer_lexer_raw {
    typedef TokenType token_type;
    typedef std::vector<token_type> buff_type;
    typedef typename buff_type::const_iterator iterator_type;

    typedef typename boost::detail::iterator_traits<typename token_type::iterator_type>::value_type char_type;

private:
    buff_type buff_;

public:
    buffer_lexer_raw() {}

    void set_buffer(const buff_type & b) { buff_ = b; }

    iterator_type begin() const { return buff_.begin(); }
    iterator_type end() const { return buff_.end(); }

    // for consistency with regular lexer `begin` signature, not sure if this is needed
    template<typename T>
    iterator_type begin(T, T) { return begin(); }

    std::size_t add_token(char_type const* state, char_type tokendef,
            std::size_t token_id, char_type const* targetstate)
    {
        return 1;
    }

    void clear(char_type const* state) {}
};

The test code responds to a macro defined at the top of the file.

// Use the type "buffer_lexer" which derives from lex_basic<Lexer>
//#define WHICH_LEXER_TYPE 1
// Use the type "buffer_lexer_raw" which does not derive from anything
//#define WHICH_LEXER_TYPE 2
// Use the "placebo" lexer, which is just lex_basic<Lexer>, as a sanity test of our lex:: api calls
#define WHICH_LEXER_TYPE 0

The test code will:

Currently the #define WHICH_LEXER_TYPE 0 option compiles and works great for me with both gcc-4.8 and clang-3.6.

I can't actually get it to compile with the #define WHICH_LEXER_TYPE 1 or #define WHICH_LEXER_TYPE 2 options. With type 1, clang gives the following error message which I don't have the foggiest idea about:

In file included from main.cpp:1:
In file included from /usr/include/boost/spirit/include/lex_lexertl.hpp:16:
In file included from /usr/include/boost/spirit/home/lex/lexer_lexertl.hpp:15:
In file included from /usr/include/boost/spirit/home/lex.hpp:13:
In file included from /usr/include/boost/spirit/home/lex/lexer.hpp:14:
In file included from /usr/include/boost/spirit/home/lex/lexer/token_def.hpp:21:
In file included from /usr/include/boost/spirit/home/lex/reference.hpp:16:
/usr/include/boost/spirit/home/qi/reference.hpp:43:30: error: no matching member function for call to 'parse'
            return ref.get().parse(first, last, context, skipper, attr);
                   ~~~~~~~~~~^~~~~
/usr/include/boost/spirit/home/qi/parse.hpp:86:42: note: in instantiation of function template specialization 'boost::spirit::qi::reference<const
      boost::spirit::qi::rule<boost::spirit::lex::lexertl::iterator<boost::spirit::lex::lexertl::functor<boost::spirit::lex::lexertl::token<__gnu_cxx::__normal_iterator<const
      char *, std::basic_string<char> >, boost::mpl::vector<char, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
      mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, mpl_::bool_<true>, unsigned long>, lexertl::detail::data,
      __gnu_cxx::__normal_iterator<const char *, std::basic_string<char> >, mpl_::bool_<false>, mpl_::bool_<true> > >, ast::Body (),
      boost::spirit::locals<std::basic_string<char>, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>,
      boost::spirit::unused_type, boost::spirit::unused_type> >::parse<__gnu_cxx::__normal_iterator<const
      boost::spirit::lex::lexertl::token<__gnu_cxx::__normal_iterator<const char *, std::basic_string<char> >, boost::mpl::vector<char, mpl_::na, mpl_::na, mpl_::na,
      mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>,
      mpl_::bool_<true>, unsigned long> *, std::vector<boost::spirit::lex::lexertl::token<__gnu_cxx::__normal_iterator<const char *, std::basic_string<char> >,
      boost::mpl::vector<char, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
      mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, mpl_::bool_<true>, unsigned long>,
      std::allocator<boost::spirit::lex::lexertl::token<__gnu_cxx::__normal_iterator<const char *, std::basic_string<char> >, boost::mpl::vector<char, mpl_::na,
      mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
      mpl_::na, mpl_::na>, mpl_::bool_<true>, unsigned long> > > >, boost::spirit::context<boost::fusion::cons<ast::Body &, boost::fusion::nil>,
      boost::spirit::locals<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na> >, boost::spirit::unused_type,
      ast::Body>' requested here
        return compile<qi::domain>(expr).parse(first, last, context, unused, attr);
                                         ^
main.cpp:414:12: note: in instantiation of function template specialization 'boost::spirit::qi::parse<__gnu_cxx::__normal_iterator<const
      boost::spirit::lex::lexertl::token<__gnu_cxx::__normal_iterator<const char *, std::basic_string<char> >, boost::mpl::vector<char, mpl_::na, mpl_::na, mpl_::na,
      mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>,
      mpl_::bool_<true>, unsigned long> *, std::vector<boost::spirit::lex::lexertl::token<__gnu_cxx::__normal_iterator<const char *, std::basic_string<char> >,
      boost::mpl::vector<char, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
      mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, mpl_::bool_<true>, unsigned long>,
      std::allocator<boost::spirit::lex::lexertl::token<__gnu_cxx::__normal_iterator<const char *, std::basic_string<char> >, boost::mpl::vector<char, mpl_::na,
      mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
      mpl_::na, mpl_::na>, mpl_::bool_<true>, unsigned long> > > >,
      basic_grammar<boost::spirit::lex::lexertl::iterator<boost::spirit::lex::lexertl::functor<boost::spirit::lex::lexertl::token<__gnu_cxx::__normal_iterator<const
      char *, std::basic_string<char> >, boost::mpl::vector<char, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
      mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, mpl_::bool_<true>, unsigned long>, lexertl::detail::data,
      __gnu_cxx::__normal_iterator<const char *, std::basic_string<char> >, mpl_::bool_<false>, mpl_::bool_<true> > > >, ast::Body>' requested here
                if (!qi::parse(it, fin, bgram, tree2)) {
                         ^
/usr/include/boost/spirit/home/qi/nonterminal/rule.hpp:273:14: note: candidate function [with Context = boost::spirit::context<boost::fusion::cons<ast::Body &,
      boost::fusion::nil>, boost::spirit::locals<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na> >, Skipper =
      boost::spirit::unused_type, Attribute = ast::Body] not viable: no known conversion from '__gnu_cxx::__normal_iterator<const
      boost::spirit::lex::lexertl::token<__gnu_cxx::__normal_iterator<const char *, std::basic_string<char> >, boost::mpl::vector<char, mpl_::na, mpl_::na, mpl_::na,
      mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>,
      mpl_::bool_<true>, unsigned long> *, std::vector<boost::spirit::lex::lexertl::token<__gnu_cxx::__normal_iterator<const char *, std::basic_string<char> >,
      boost::mpl::vector<char, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
      mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, mpl_::bool_<true>, unsigned long>,
      std::allocator<boost::spirit::lex::lexertl::token<__gnu_cxx::__normal_iterator<const char *, std::basic_string<char> >, boost::mpl::vector<char, mpl_::na,
      mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
      mpl_::na, mpl_::na>, mpl_::bool_<true>, unsigned long> > > >' to
      'boost::spirit::lex::lexertl::iterator<boost::spirit::lex::lexertl::functor<boost::spirit::lex::lexertl::token<__gnu_cxx::__normal_iterator<const char *,
      std::basic_string<char> >, boost::mpl::vector<char, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
      mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, mpl_::bool_<true>, unsigned long>, lexertl::detail::data,
      __gnu_cxx::__normal_iterator<const char *, std::basic_string<char> >, mpl_::bool_<false>, mpl_::bool_<true> > > &' for 1st argument
        bool parse(Iterator& first, Iterator const& last
             ^
/usr/include/boost/spirit/home/qi/nonterminal/rule.hpp:319:14: note: candidate function template not viable: requires 6 arguments, but 5 were provided
        bool parse(Iterator& first, Iterator const& last
             ^
1 error generated.

The "2" option gives essentially the same error message. gcc doesn't seem to give a better error message.

Here's the complete source code:

#include <boost/spirit/include/lex_lexertl.hpp>
#include <boost/spirit/include/qi.hpp>

#include <boost/fusion/include/adapt_struct.hpp>
#include <boost/fusion/include/std_pair.hpp>
#include <boost/variant/get.hpp>
#include <boost/variant/variant.hpp>
#include <boost/variant/recursive_variant.hpp>
#include <boost/preprocessor/stringize.hpp>

#include <vector>
#include <string>

typedef unsigned int uint;

namespace lex = boost::spirit::lex;
namespace qi = boost::spirit::qi;
namespace mpl = boost::mpl;

// Use the type "buffer_lexer" which derives from lex_basic<Lexer>
//#define WHICH_LEXER_TYPE 1
// Use the type "buffer_lexer_raw" which does not derive from anything
//#define WHICH_LEXER_TYPE 2
// Use the "placebo" lexer, which is just lex_basic<Lexer>, as a sanity test of
// our lex:: api calls
#define WHICH_LEXER_TYPE 0

//// Lexer definition

enum tokenids {
  LCARET = lex::min_token_id + 10,
  RCARET,
  BSLASH,
  LBRACE,
  RBRACE,
  LPAREN,
  RPAREN,
  EQUALS,
  USCORE,
  ALPHA,
  NUM,
  EOL,
  BLANK,
  IDANY
};

#define TOKEN_CASE(X)                                                          \
  case X: return #X

const char *token_id_string(size_t id) {
  switch (id) {
    TOKEN_CASE(LCARET);
    TOKEN_CASE(RCARET);
    TOKEN_CASE(BSLASH);
    TOKEN_CASE(LBRACE);
    TOKEN_CASE(RBRACE);
    TOKEN_CASE(LPAREN);
    TOKEN_CASE(RPAREN);
    TOKEN_CASE(EQUALS);
    TOKEN_CASE(USCORE);
    TOKEN_CASE(ALPHA);
    TOKEN_CASE(NUM);
    TOKEN_CASE(EOL);
    TOKEN_CASE(BLANK);
    TOKEN_CASE(IDANY);
  default:
    return "Unknown token";
  }
}

template <typename Lexer> struct lex_basic : lex::lexer<Lexer> {
  lex_basic() {
    this->self.add
        ('<', LCARET)
        ('>', RCARET)
        ('/', BSLASH)
        ('{', LBRACE)
        ('}', RBRACE)
        ('(', LPAREN)
        (')', RPAREN)
        ('=', EQUALS)
        ('_', USCORE)
        ("[A-Za-z]", ALPHA)
        ("[0-9]", NUM)
        ('\n', EOL)
        ("[ \\t\\r]", BLANK)
        (".", IDANY);
  }
};

typedef std::string::const_iterator str_it;
// the token type needs to know the iterator type of the underlying
// input and the set of used token value types
typedef lex::lexertl::token<str_it, mpl::vector<char>> token_type;

template <typename TokenType> struct token_buffer {
  std::vector<TokenType> tokens_;

  token_buffer() = default;

  bool operator()(token_type t) {
    tokens_.push_back(t);
    return true;
  }

  void print(std::ostream &o) const {
    std::cout << "tokens_.size() == " << tokens_.size() << std::endl;
    for (size_t i = 0; i < tokens_.size(); ++i) {
      const TokenType &t = tokens_[i];

      o << "[" << i << "]: -" << token_id_string(t.id()) << "- \"" << t
        << "\" [";

      const auto &v = t.value();
      if (t.id() == EOL) {
        o << "\\n";
      } else {
        o << v;
      }
      o << "]" << std::endl;
    }
  }
};

/***
 * Lexers which serve tokens from a buffer
 */

// Two versions of the same thing, one deriving from lex::lexer, one not
template <typename LexerType> class buffer_lexer : public lex_basic<LexerType> {
public:
  typedef std::vector<token_type> buff_type;
  typedef typename buff_type::const_iterator iterator_type;

private:
  const buff_type &buff_;

public:
  buffer_lexer(const buff_type &b) : lex_basic<LexerType>(), buff_(b) {}

  iterator_type begin() const { return buff_.begin(); }
  iterator_type end() const { return buff_.end(); }

  // for consistency with regular lexer `begin` signature, not sure if this is
  // needed
  template <typename T> iterator_type begin(T, T) { return begin(); }
};

template <typename Iterator, typename TokenType,
          typename Functor = lex::lexertl::functor<
          TokenType, lex::lexertl::detail::data, Iterator>>
class buffer_lexer_raw {
  typedef TokenType token_type;
  typedef std::vector<token_type> buff_type;
  typedef typename buff_type::const_iterator iterator_type;

  typedef typename boost::detail::iterator_traits<
      typename token_type::iterator_type>::value_type char_type;

private:
  buff_type buff_;

public:
  buffer_lexer_raw() {}

  void set_buffer(const buff_type &b) { buff_ = b; }

  iterator_type begin() const { return buff_.begin(); }
  iterator_type end() const { return buff_.end(); }

  // for consistency with regular lexer `begin` signature, not sure if this is
  // needed
  template <typename T> iterator_type begin(T, T) { return begin(); }

  std::size_t add_token(char_type const *state, char_type tokendef,
        std::size_t token_id, char_type const *targetstate) {
    return 1;
  }

  void clear(char_type const *state) {}
};

/***
 * AST
 */

namespace ast {
typedef std::string Str;

struct BraceExpr;

typedef boost::variant<Str, boost::recursive_wrapper<BraceExpr>> BraceExprArg;

struct BraceExpr {
  std::vector<BraceExprArg> args;
};

typedef std::pair<Str, Str> Pair;

struct Body;

typedef boost::variant<Pair, BraceExpr, boost::recursive_wrapper<Body>> Node;

struct Body {
  Str key;
  std::vector<Node> nodes;
};
} // end namespace ast

BOOST_FUSION_ADAPT_STRUCT(ast::BraceExpr,
          (std::vector<ast::BraceExprArg>, args))
BOOST_FUSION_ADAPT_STRUCT(ast::Body,
          (ast::Str, key)(std::vector<ast::Node>, nodes))

namespace ast {
// Stream ops
class printer : public boost::static_visitor<> {
  std::ostream &ss_;
  uint indent_;
  std::string indent(uint extra = 0) const {
    return std::string(indent_ + extra, ' ');
  }
  std::string indent_plus_tab() const { return indent(tab_width); }

public:
  static constexpr uint tab_width = 4;

  explicit printer(std::ostream &s, uint indent = 0)
      : ss_(s), indent_(indent) {}

  void operator()(const Str &s) const { ss_ << s; }
  void operator()(const BraceExpr &b) const {
    ss_ << "{";
    for (size_t i = 0; i < b.args.size(); ++i) {
      if (i) {
        ss_ << " ";
      }
      boost::apply_visitor(*this, b.args[i]);
    }
    ss_ << "}";
  }
  void operator()(const Pair &p) const { ss_ << p.first << " = " << p.second; }

  void operator()(const Body &b) const {
    ss_ << indent() << "<" << b.key << ">\n";
    printer p{ss_, indent_ + tab_width};
    for (const auto &n : b.nodes) {
      ss_ << indent_plus_tab();
      boost::apply_visitor(p, n);
      ss_ << "\n";
    }
    ss_ << indent() << "</" << b.key << ">";
  }
};

std::ostream &operator<<(std::ostream &ss, const BraceExpr &b) {
  printer p{ss};
  p(b);
  return ss;
}

std::ostream &operator<<(std::ostream &ss, const Pair &p) {
  printer pr{ss};
  pr(p);
  return ss;
}

std::ostream &operator<<(std::ostream &ss, const Body &b) {
  printer p{ss};
  p(b);
  return ss;
}

// Equality ops
bool operator==(const Pair &p1, const Pair &p2) {
  return p1.first == p2.first && p1.second == p2.second;
}
bool operator==(const BraceExpr &b1, const BraceExpr &b2) {
  return b1.args == b2.args;
}
bool operator==(const Body &b1, const Body &b2) {
  return b1.key == b2.key && b1.nodes == b2.nodes;
}
bool operator!=(const Pair &p1, const Pair &p2) { return !(p1 == p2); }
bool operator!=(const BraceExpr &b1, const BraceExpr &b2) {
  return !(b1 == b2);
}
bool operator!=(const Body &b1, const Body &b2) { return !(b1 == b2); }
} // end namespace ast

/***
 * Grammar
 */

template <typename Iterator>
struct basic_grammar
    : qi::grammar<Iterator, ast::Body(), qi::locals<ast::Str>> {
  qi::rule<Iterator, ast::Body(), qi::locals<ast::Str>> body;
  qi::rule<Iterator, ast::Node()> node;
  qi::rule<Iterator, ast::Pair()> pair;
  qi::rule<Iterator, ast::BraceExprArg()> brace_expr_arg;
  qi::rule<Iterator, ast::BraceExpr()> brace_expr;
  qi::rule<Iterator, ast::Str()> identifier;
  qi::rule<Iterator, ast::Str()> str;
  qi::rule<Iterator, ast::Str()> open_tag;
  qi::rule<Iterator /*, ast::Str()*/> close_tag;
  qi::rule<Iterator> lbrace;
  qi::rule<Iterator> rbrace;
  qi::rule<Iterator> equals;

  qi::rule<Iterator> ws;

  template <typename TokenDef>
  basic_grammar(const TokenDef &tok)
      : basic_grammar::base_type(body, "body") {
    using namespace qi;

    ws %= token(BLANK) | token(EOL);
    lbrace %= token(LBRACE);
    rbrace %= token(RBRACE);
    equals %= token(EQUALS);
    identifier %= token(ALPHA) >> *(token(ALPHA) | token(NUM) | token(USCORE));
    str %= *(token(LCARET) | token(RCARET) | token(BSLASH) | token(LPAREN) |
         token(RPAREN) | token(ALPHA) | token(NUM) | token(USCORE) |
         token(EQUALS) | token(BLANK) | token(IDANY));
    open_tag %= omit[token(LCARET)] >> identifier >>
        omit[token(RCARET)]; // tok.open_tag;
    close_tag %= omit[token(LCARET) >> token(BSLASH)] >> identifier >>
         omit[token(RCARET)]; // tok.close_tag;

    pair = skip(boost::proto::deep_copy(ws))[identifier >> equals >> str];

    body = skip(boost::proto::deep_copy(ws))[open_tag >> *node >> close_tag];
    node = brace_expr | body | pair;

    brace_expr_arg = brace_expr | identifier;
    brace_expr =
        skip(boost::proto::deep_copy(ws))[lbrace >> *brace_expr_arg >> rbrace];
  }
};

/***
 * Usage / Tests
 */

// use actor_lexer<> here if your token definitions have semantic
// actions
typedef lex::lexertl::lexer<token_type> lexer_type;

// this is the iterator exposed by the lexer, we use this for parsing
typedef lexer_type::iterator_type iterator_type;

token_buffer<token_type> test_lexer(const std::string &input,
        bool silent = false) {
  str_it s = input.begin();
  str_it end = input.end();

  // create a lexer instance
  lex_basic<lexer_type> lex;

  token_buffer<token_type> buff;
  if (!lex::tokenize(s, end, lex, [&](token_type t) { return buff(t); })) {
    if (!silent) {
      std::cout << "\nTokenizing failed!" << std::endl;
    }
  } else {
    if (!silent) {
      std::cout << "\nTokenizing succeeded!" << std::endl;
    }
  }

  if (!silent) {
    buff.print(std::cout);
  }
  return buff;
}

void test_grammar(const std::string &input) {
  lex_basic<lexer_type> lex;
  basic_grammar<iterator_type> gram{lex};
  ast::Body tree;

  {
    str_it s = input.begin();
    str_it end = input.end();

    if (!lex::tokenize_and_parse(s, end, lex, gram, tree)) {
      std::cout << "\nParsing failed!" << std::endl;
    } else {
      std::cout << "\nParsing succeeded!" << std::endl;
    }

    std::cout << tree << std::endl;
  }

  // Now try to do it in two steps, with buffered lexer
  auto buff = test_lexer(input, true); // get buffer, silence output

#if WHICH_LEXER_TYPE == 1
  buffer_lexer<lexer_type> blex{buff.tokens_};
#else
#if WHICH_LEXER_TYPE == 2
  buffer_lexer_raw<str_it, token_type> blex;
  blex.set_buffer(buff.tokens_);
#else
  lex_basic<lexer_type> blex;
#endif
#endif

  basic_grammar<iterator_type> bgram{blex};
  ast::Body tree2;

  {
#if (WHICH_LEXER_TYPE == 1) || (WHICH_LEXER_TYPE == 2)
    auto it = blex.begin();
#else
    str_it s = input.begin();
    str_it end = input.end();
    auto it = blex.begin(s, end);
#endif

    auto fin = blex.end();

    if (!qi::parse(it, fin, bgram, tree2)) {
      std::cout << "\nBuffered parsing failed!" << std::endl;
    } else {
      std::cout << "\nBuffered parsing succeeded!" << std::endl;
    }
  }

  std::cout << tree2 << std::endl;

  if (tree != tree2) {
    std::cout << "\nRegular parsing vs. buffered parsing mismatch!"
          << std::endl;
  }
}

int main() {
  std::string input{""
"<asdf>\n"
"foo = bar\n"
"{F foo}\n"
"{G {F foo} {H bar}}\n"
"</asdf>\n"};

  test_lexer(input);

  // Use lexer and grammar at once as demonstrated in tutorials

  std::string input2 = "<asdf></asdf>";
  test_grammar(input2);

  test_grammar(input);

  std::string input3{""
"<asdf>\n"
"foo = bar\n"
"{F foo}\n"
"{G {F foo} {H bar}}\n"
"<jkl>\n"
"baz = gaz\n"
"{H {H H} {{{H} {G} {F foo}} {B ar}} {Q i}}\n"
"</jkl>\n"
"</asdf>\n"};

  test_grammar(input3);

  return 0;
}

Solution

  • I too thought multi-pass was to blame, but after much fiddling, I was able to get it to work with 2 easy fixes ¹

    template <typename Iterator, typename TokenType,
              typename Functor = lex::lexertl::functor<
              TokenType, lex::lexertl::detail::data, Iterator>>
    class buffer_lexer_raw {
        typedef TokenType token_type;
        typedef std::vector<token_type> buff_type;
        typedef typename buff_type::const_iterator base_iterator_type;
      public:
    
        struct iterator_type : base_iterator_type {
            typedef base_iterator_type base_iterator_type;
            using base_iterator_type::base_iterator_type;
        };
    
        typedef char char_type;
    

    This ensures that the nested iterator_type has itself a base_iterator_type type. This appears to be required somewhere down in the bowels of the library (likely due to assumptions about token iterators).

    The second part is where the grammar is actually instantiated, don't use the "plain" iterator, but the one we just defined:

    basic_grammar<concrete_lexer_type::iterator_type> bgram{blex};
    

    Fully working listing:

    #include <boost/spirit/include/lex_lexertl.hpp>
    #include <boost/spirit/include/qi.hpp>
    
    #include <boost/fusion/include/adapt_struct.hpp>
    #include <boost/fusion/include/std_pair.hpp>
    #include <boost/variant/get.hpp>
    #include <boost/variant/variant.hpp>
    #include <boost/variant/recursive_variant.hpp>
    #include <boost/preprocessor/stringize.hpp>
    
    #include <vector>
    #include <string>
    
    typedef unsigned int uint;
    
    namespace lex = boost::spirit::lex;
    namespace qi  = boost::spirit::qi;
    namespace mpl = boost::mpl;
    
    //// Lexer definition
    
    enum tokenids {
      LCARET = lex::min_token_id + 10,
      RCARET,
      BSLASH,
      LBRACE,
      RBRACE,
      LPAREN,
      RPAREN,
      EQUALS,
      USCORE,
      ALPHA,
      NUM,
      EOL,
      BLANK,
      IDANY
    };
    
    #define TOKEN_CASE(X)                                                          \
      case X: return #X
    
    const char *token_id_string(size_t id) {
      switch (id) {
        TOKEN_CASE(LCARET);
        TOKEN_CASE(RCARET);
        TOKEN_CASE(BSLASH);
        TOKEN_CASE(LBRACE);
        TOKEN_CASE(RBRACE);
        TOKEN_CASE(LPAREN);
        TOKEN_CASE(RPAREN);
        TOKEN_CASE(EQUALS);
        TOKEN_CASE(USCORE);
        TOKEN_CASE(ALPHA);
        TOKEN_CASE(NUM);
        TOKEN_CASE(EOL);
        TOKEN_CASE(BLANK);
        TOKEN_CASE(IDANY);
      default:
        return "Unknown token";
      }
    }
    
    template <typename Lexer> struct lex_basic : lex::lexer<Lexer> {
      lex_basic() {
        this->self.add
            ('<', LCARET)
            ('>', RCARET)
            ('/', BSLASH)
            ('{', LBRACE)
            ('}', RBRACE)
            ('(', LPAREN)
            (')', RPAREN)
            ('=', EQUALS)
            ('_', USCORE)
            ("[A-Za-z]", ALPHA)
            ("[0-9]", NUM)
            ('\n', EOL)
            ("[ \\t\\r]", BLANK)
            (".", IDANY);
      }
    };
    
    typedef std::string::const_iterator str_it;
    // the token type needs to know the iterator type of the underlying
    // input and the set of used token value types
    typedef lex::lexertl::token<str_it, mpl::vector<char>> token_type;
    
    template <typename TokenType> struct token_buffer {
      std::vector<TokenType> tokens_;
    
      token_buffer() = default;
    
      bool operator()(token_type t) {
        tokens_.push_back(t);
        return true;
      }
    
      void print(std::ostream &o) const {
        std::cout << "tokens_.size() == " << tokens_.size() << std::endl;
        for (size_t i = 0; i < tokens_.size(); ++i) {
          const TokenType &t = tokens_[i];
    
          o << "[" << i << "]: -" << token_id_string(t.id()) << "- \"" << t
            << "\" [";
    
          const auto &v = t.value();
          if (t.id() == EOL) {
            o << "\\n";
          } else {
            o << v;
          }
          o << "]" << std::endl;
        }
      }
    };
    
    /***
     * Lexers which serve tokens from a buffer
     */
    
    // Two versions of the same thing, one deriving from lex::lexer, one not
    template <typename LexerType> class buffer_lexer : public lex_basic<LexerType> {
    public:
      typedef std::vector<token_type> buff_type;
      typedef typename buff_type::const_iterator iterator_type;
    
    private:
      const buff_type &buff_;
    
    public:
      buffer_lexer(const buff_type &b) : lex_basic<LexerType>(), buff_(b) {}
    
      iterator_type begin() const { return buff_.begin(); }
      iterator_type end()   const { return buff_.end(); }
    
      // for consistency with regular lexer `begin` signature, not sure if this is
      // needed
      template <typename T> iterator_type begin(T, T) { return begin(); }
    };
    
    template <typename Iterator, typename TokenType,
              typename Functor = lex::lexertl::functor<
              TokenType, lex::lexertl::detail::data, Iterator>>
    class buffer_lexer_raw {
        typedef TokenType token_type;
        typedef std::vector<token_type> buff_type;
        typedef typename buff_type::const_iterator vec_iterator_type;
      public:
    
        struct iterator_type : vec_iterator_type {
            typedef vec_iterator_type base_iterator_type;
            using vec_iterator_type::vec_iterator_type;
        };
    
        typedef char char_type;
    
    private:
        buff_type buff_;
    
    public:
        buffer_lexer_raw() {}
    
        void set_buffer(const buff_type &b) { buff_ = b; }
    
        iterator_type begin() const { return buff_.begin(); } 
        iterator_type end()   const { return buff_.end();   } 
    
        // for consistency with regular lexer `begin` signature, not sure if this is
        // needed
        template <typename T> iterator_type begin(T, T) { return begin(); }
    
        std::size_t add_token(char_type const*, char_type, std::size_t, char_type const*) {
            return 1;
        }
    
        void clear(char_type const *) {}
    };
    
    /***
     * AST
     */
    
    namespace ast {
        typedef std::string Str;
    
        struct BraceExpr;
    
        typedef boost::variant<Str, boost::recursive_wrapper<BraceExpr>> BraceExprArg;
    
        struct BraceExpr {
            std::vector<BraceExprArg> args;
        };
    
        typedef std::pair<Str, Str> Pair;
    
        struct Body;
    
        typedef boost::variant<Pair, BraceExpr, boost::recursive_wrapper<Body>> Node;
    
        struct Body {
            Str key;
            std::vector<Node> nodes;
        };
    } // end namespace ast
    
    BOOST_FUSION_ADAPT_STRUCT(ast::BraceExpr,
              (std::vector<ast::BraceExprArg>, args))
    BOOST_FUSION_ADAPT_STRUCT(ast::Body,
              (ast::Str, key)(std::vector<ast::Node>, nodes))
    
    namespace ast {
        // Stream ops
        class printer : public boost::static_visitor<> {
            std::ostream &ss_;
    
            uint indent_;
            std::string indent(uint extra = 0) const { return std::string(indent_ + extra, ' '); }
            std::string indent_plus_tab() const { return indent(tab_width); }
    
          public:
            static constexpr uint tab_width = 4;
    
            explicit printer(std::ostream &s, uint indent = 0)
                : ss_(s), indent_(indent) {}
    
            void operator()(const Str &s) const { ss_ << s; }
            void operator()(const BraceExpr &b) const {
                ss_ << "{";
                for (size_t i = 0; i < b.args.size(); ++i) {
                    if (i) {
                        ss_ << " ";
                    }
                    boost::apply_visitor(*this, b.args[i]);
                }
                ss_ << "}";
            }
            void operator()(const Pair &p) const { ss_ << p.first << " = " << p.second; }
    
            void operator()(const Body &b) const {
                ss_ << indent() << "<" << b.key << ">\n";
                printer p{ss_, indent_ + tab_width};
                for (const auto &n : b.nodes) {
                    ss_ << indent_plus_tab();
                    boost::apply_visitor(p, n);
                    ss_ << "\n";
                }
                ss_ << indent() << "</" << b.key << ">";
            }
        };
    
        std::ostream &operator<<(std::ostream &ss, const BraceExpr &b) {
            printer p{ss};
            p(b);
            return ss;
        }
    
        std::ostream &operator<<(std::ostream &ss, const Pair &p) {
            printer pr{ss};
            pr(p);
            return ss;
        }
    
        std::ostream &operator<<(std::ostream &ss, const Body &b) {
            printer p{ss};
            p(b);
            return ss;
        }
    
        // Equality ops
        bool operator==(const Pair &p1, const Pair &p2) {
            return p1.first == p2.first && p1.second == p2.second;
        }
        bool operator==(const BraceExpr &b1, const BraceExpr &b2) {
            return b1.args == b2.args;
        }
        bool operator==(const Body &b1, const Body &b2) {
            return b1.key == b2.key && b1.nodes == b2.nodes;
        }
        bool operator!=(const Pair &p1, const Pair &p2) { return !(p1 == p2); }
        bool operator!=(const BraceExpr &b1, const BraceExpr &b2) {
            return !(b1 == b2);
        }
        bool operator!=(const Body &b1, const Body &b2) { return !(b1 == b2); }
    } // end namespace ast
    
    /***
     * Grammar
     */
    
    template <typename Iterator>
    struct basic_grammar : qi::grammar<Iterator, ast::Body(), qi::locals<ast::Str>> {
        qi::rule<Iterator, ast::Body(), qi::locals<ast::Str>> body;
        qi::rule<Iterator, ast::Node()>         node;
        qi::rule<Iterator, ast::Pair()>         pair;
        qi::rule<Iterator, ast::BraceExprArg()> brace_expr_arg;
        qi::rule<Iterator, ast::BraceExpr()>    brace_expr;
        qi::rule<Iterator, ast::Str()>          identifier;
        qi::rule<Iterator, ast::Str()>          str;
        qi::rule<Iterator, ast::Str()>          open_tag;
        qi::rule<Iterator  /*, ast::Str()*/>    close_tag;
        qi::rule<Iterator> lbrace;
        qi::rule<Iterator> rbrace;
        qi::rule<Iterator> equals;
    
        qi::rule<Iterator> ws;
    
        template <typename TokenDef>
        basic_grammar(const TokenDef &tok) : basic_grammar::base_type(body, "body") {
            using namespace qi;
    
            ws            %= token(BLANK) | token(EOL);
            lbrace        %= token(LBRACE);
            rbrace        %= token(RBRACE);
            equals        %= token(EQUALS);
            identifier    %= token(ALPHA) >> *(token(ALPHA) | token(NUM) | token(USCORE));
            str           %= *(token(LCARET) | token(RCARET) | token(BSLASH) | token(LPAREN) |
                    token(RPAREN) | token(ALPHA)  | token(NUM)    | token(USCORE) |
                    token(EQUALS) | token(BLANK)  | token(IDANY));
    
            open_tag      %= omit[token(LCARET)]                   >> identifier >> omit[token(RCARET)]; // tok.open_tag;
            close_tag     %= omit[token(LCARET)  >> token(BSLASH)] >> identifier >> omit[token(RCARET)]; // tok.close_tag;
    
            // TODO FIXME the deep_copy shoudl be not required there
            /// bla_12 = somevalue
    
            pair           = skip(boost::proto::deep_copy(ws)) [ identifier >> equals >> str    ] ;
    
            /// <bla><sub>{some}{braced{expres}}sions</sub><pair1>key1=value</pair1></bla>
            body           = skip(boost::proto::deep_copy(ws)) [ open_tag >> *node >> close_tag ] ;
            /// 
            node           = brace_expr | body | pair;
    
            brace_expr_arg = brace_expr | identifier;
    
            /// {{{bla}some{other}nested{id{entifier}s}}and such}
            brace_expr     = skip(boost::proto::deep_copy(ws))[lbrace >> *brace_expr_arg >> rbrace];
        }
    };
    
    /***
     * Usage / Tests
     */
    
    // use actor_lexer<> here if your token definitions have semantic
    // actions
    typedef lex::lexertl::lexer<token_type> lexer_type;
    
    // this is the iterator exposed by the lexer, we use this for parsing
    typedef lexer_type::iterator_type iterator_type;
    
    token_buffer<token_type> test_lexer(const std::string &input, 
            bool silent = false) {
        str_it s   = input.begin();
        str_it end = input.end();
    
        // create a lexer instance
        lex_basic<lexer_type> lex;
    
        token_buffer<token_type> buff;
        if (!lex::tokenize(s, end, lex, [&](token_type t) { return buff(t); })) {
            if (!silent) {
                std::cout << "\nTokenizing failed!" << std::endl;
            }
        } else {
            if (!silent) {
                std::cout << "\nTokenizing succeeded!" << std::endl;
            }
        }
    
        if (!silent) {
            buff.print(std::cout);
        }
        return buff;
    }
    
    void test_grammar(const std::string &input) {
        lex_basic<lexer_type> lex;
        basic_grammar<iterator_type> gram{lex};
        ast::Body tree;
    
        {
            str_it s = input.begin();
            str_it end = input.end();
    
            if (!lex::tokenize_and_parse(s, end, lex, gram, tree)) {
                std::cout << "\nParsing failed!" << std::endl;
            } else {
                std::cout << "\nParsing succeeded!" << std::endl;
            }
    
            std::cout << tree << std::endl;
        }
    
        // Now try to do it in two steps, with buffered lexer
        auto buff = test_lexer(input, true); // get buffer, silence output
    
        typedef buffer_lexer_raw<str_it, token_type> concrete_lexer_type;
    
        buffer_lexer_raw<str_it, token_type> blex;
        blex.set_buffer(buff.tokens_);
    
    
        basic_grammar<concrete_lexer_type::iterator_type> bgram{blex};
        ast::Body tree2;
    
        {
            auto it = blex.begin();
            auto fin = blex.end();
    
            if (!qi::parse(it, fin, bgram, tree2)) {
                std::cout << "\nBuffered parsing failed!" << std::endl;
            } else {
                std::cout << "\nBuffered parsing succeeded!" << std::endl;
            }
        }
    
        std::cout << tree2 << std::endl;
    
        if (tree != tree2) {
            std::cout << "\nRegular parsing vs. buffered parsing mismatch!"
                << std::endl;
        }
    }
    
    int main() {
    std::string const input{""
        "<asdf>\n"
        "foo = bar\n"
        "{F foo}\n"
        "{G {F foo} {H bar}}\n"
        "</asdf>\n"};
    
        test_lexer(input);
    
        // Use lexer and grammar at once as demonstrated in tutorials
    
        std::string const input2 = "<asdf></asdf>";
        test_grammar(input2);
    
        test_grammar(input);
    
        std::string const input3{""
            "<asdf>\n"
            "foo = bar\n"
            "{F foo}\n"
            "{G {F foo} {H bar}}\n"
            "<jkl>\n"
            "baz = gaz\n"
            "{H {H H} {{{H} {G} {F foo}} {B ar}} {Q i}}\n"
            "</jkl>\n"
            "</asdf>\n"};
    
        test_grammar(input3);
    }
    

    Printing:

    Tokenizing succeeded!
    tokens_.size() == 53
    [0]: -LCARET- "65546" [<]
    [1]: -ALPHA- "65555" [a]
    [2]: -ALPHA- "65555" [s]
    [3]: -ALPHA- "65555" [d]
    [4]: -ALPHA- "65555" [f]
    [5]: -RCARET- "65547" [>]
    [6]: -EOL- "65557" [\n]
    [7]: -ALPHA- "65555" [f]
    [8]: -ALPHA- "65555" [o]
    [9]: -ALPHA- "65555" [o]
    [10]: -BLANK- "65558" [ ]
    [11]: -EQUALS- "65553" [=]
    [12]: -BLANK- "65558" [ ]
    [13]: -ALPHA- "65555" [b]
    [14]: -ALPHA- "65555" [a]
    [15]: -ALPHA- "65555" [r]
    [16]: -EOL- "65557" [\n]
    [17]: -LBRACE- "65549" [{]
    [18]: -ALPHA- "65555" [F]
    [19]: -BLANK- "65558" [ ]
    [20]: -ALPHA- "65555" [f]
    [21]: -ALPHA- "65555" [o]
    [22]: -ALPHA- "65555" [o]
    [23]: -RBRACE- "65550" [}]
    [24]: -EOL- "65557" [\n]
    [25]: -LBRACE- "65549" [{]
    [26]: -ALPHA- "65555" [G]
    [27]: -BLANK- "65558" [ ]
    [28]: -LBRACE- "65549" [{]
    [29]: -ALPHA- "65555" [F]
    [30]: -BLANK- "65558" [ ]
    [31]: -ALPHA- "65555" [f]
    [32]: -ALPHA- "65555" [o]
    [33]: -ALPHA- "65555" [o]
    [34]: -RBRACE- "65550" [}]
    [35]: -BLANK- "65558" [ ]
    [36]: -LBRACE- "65549" [{]
    [37]: -ALPHA- "65555" [H]
    [38]: -BLANK- "65558" [ ]
    [39]: -ALPHA- "65555" [b]
    [40]: -ALPHA- "65555" [a]
    [41]: -ALPHA- "65555" [r]
    [42]: -RBRACE- "65550" [}]
    [43]: -RBRACE- "65550" [}]
    [44]: -EOL- "65557" [\n]
    [45]: -LCARET- "65546" [<]
    [46]: -BSLASH- "65548" [/]
    [47]: -ALPHA- "65555" [a]
    [48]: -ALPHA- "65555" [s]
    [49]: -ALPHA- "65555" [d]
    [50]: -ALPHA- "65555" [f]
    [51]: -RCARET- "65547" [>]
    [52]: -EOL- "65557" [\n]
    
    Parsing succeeded!
    <asdf>
    </asdf>
    
    Buffered parsing succeeded!
    <asdf>
    </asdf>
    
    Parsing succeeded!
    <asdf>
        foo = bar
        {F foo}
        {G {F foo} {H bar}}
    </asdf>
    
    Buffered parsing succeeded!
    <asdf>
        foo = bar
        {F foo}
        {G {F foo} {H bar}}
    </asdf>
    
    Parsing succeeded!
    <asdf>
        foo = bar
        {F foo}
        {G {F foo} {H bar}}
            <jkl>
            baz = gaz
            {H {H H} {{{H} {G} {F foo}} {B ar}} {Q i}}
        </jkl>
    </asdf>
    
    Buffered parsing succeeded!
    <asdf>
        foo = bar
        {F foo}
        {G {F foo} {H bar}}
            <jkl>
            baz = gaz
            {H {H H} {{{H} {G} {F foo}} {B ar}} {Q i}}
        </jkl>
    </asdf>
    

    ¹ based on the buffer_lexer_raw approach