c++parsingboost-spiritqi

Boost Spirit: slow parsing optimization


I'm new to Spirit and to Boost in general. I'm trying to parse a section of a VRML file that looks like this:

point
            [
            #coordinates written in meters.
            -3.425386e-001 -1.681608e-001 0.000000e+000,
            -3.425386e-001 -1.642545e-001 0.000000e+000,
            -3.425386e-001 -1.603483e-001 0.000000e+000,

The comment that starts with # is optional.

I've written a grammar, that works fine, but the parsing process is taking to long. I would like to optimize it to run much faster. My code looks like this:

struct Point
{
    double a;
    double b;
    double c;

    Point() : a(0.0), b(0.0), c(0.0){}
};

BOOST_FUSION_ADAPT_STRUCT
(
    Point,
    (double, a) 
    (double, b)
    (double, c)
)

namespace qi   = boost::spirit::qi;
namespace repo = boost::spirit::repository;

template <typename Iterator>
struct PointParser : 
public qi::grammar<Iterator, std::vector<Point>(), qi::space_type>
{
   PointParser() : PointParser::base_type(start, "PointGrammar")
   {
      singlePoint = qi::double_>>qi::double_>>qi::double_>>*qi::lit(",");
      comment = qi::lit("#")>>*(qi::char_("a-zA-Z.") - qi::eol);
      prefix = repo::seek[qi::lexeme[qi::skip[qi::lit("point")>>qi::lit("[")>>*comment]]];
      start %= prefix>>qi::repeat[singlePoint];     

      //BOOST_SPIRIT_DEBUG_NODES((prefix)(comment)(singlePoint)(start));
   }

   qi::rule<Iterator, Point(), qi::space_type>              singlePoint;
   qi::rule<Iterator, qi::space_type>                       comment;
   qi::rule<Iterator, qi::space_type>                       prefix;
   qi::rule<Iterator, std::vector<Point>(), qi::space_type> start;
};

The section that I intend to parse, is located at the middle of the input text, so I need to skip the portion of the text in order to get to it. I implemented it using repo::seek. Is this the best way?

I run the parser in the following way:

std::vector<Point> points;
typedef PointParser<std::string::const_iterator> pointParser;   
pointParser g2; 

auto start = ch::high_resolution_clock::now();  
bool r = phrase_parse(Data.begin(), Data.end(), g2, qi::space, points); 
auto end = ch::high_resolution_clock::now();

auto duration = ch::duration_cast<boost::chrono::milliseconds>(end - start).count();

To parse about 80k entries in the input text, takes about 2.5 seconds, which is pretty slow for my needs. My question is there a way to write a parsing rules in more optimized way to make it (much) faster? How can I improve this implementation in general?

I'm new to Spirit, so some explanation will be much appreciated.


Solution

  • I've hooked your grammar into a Nonius benchmark and generated uniformly random input data of ~85k lines (download: http://stackoverflow-sehe.s3.amazonaws.com/input.txt, 7.4 MB).

    When reading the file up-front I consistently get a time of ~36ms to parse the whole bunch.

    clock resolution: mean is 17.616 ns (40960002 iterations)
    
    benchmarking sample
    collecting 100 samples, 1 iterations each, in estimated 3.82932 s
    mean: 36.0971 ms, lb 35.9127 ms, ub 36.4456 ms, ci 0.95
    std dev: 1252.71 μs, lb 762.716 μs, ub 2.003 ms, ci 0.95
    found 6 outliers among 100 samples (6%)
    variance is moderately inflated by outliers
    

    Code: see below.


    Notes:

    Live On Coliru

    #include <boost/fusion/adapted/struct.hpp>
    #include <boost/spirit/include/qi.hpp>
    #include <boost/spirit/repository/include/qi_seek.hpp>
    
    namespace qi   = boost::spirit::qi;
    namespace repo = boost::spirit::repository;
    
    struct Point { double a = 0, b = 0, c = 0; };
    
    BOOST_FUSION_ADAPT_STRUCT(Point, a, b, c)
    
    template <typename Iterator>
    struct PointParser : public qi::grammar<Iterator, std::vector<Point>(), qi::space_type>
    {
        PointParser() : PointParser::base_type(start, "PointGrammar")
        {
            singlePoint = qi::double_ >> qi::double_ >> qi::double_ >> *qi::lit(',');
    
            comment     = '#' >> *(qi::char_ - qi::eol);
    
            prefix      = repo::seek[
                qi::lit("point") >> '[' >> *comment
                ];
    
            //prefix      = repo::seek[qi::lexeme[qi::skip[qi::lit("point")>>qi::lit("[")>>*comment]]];
    
            start      %= prefix >> *singlePoint;
    
            //BOOST_SPIRIT_DEBUG_NODES((prefix)(comment)(singlePoint)(start));
        }
    
      private:
        qi::rule<Iterator, Point(), qi::space_type>              singlePoint;
        qi::rule<Iterator, std::vector<Point>(), qi::space_type> start;
        qi::rule<Iterator, qi::space_type>                       prefix;
        // implicit lexeme:
        qi::rule<Iterator>  comment;
    };
    
    #include <nonius/benchmark.h++>
    #include <nonius/main.h++>
    #include <boost/iostreams/device/mapped_file.hpp>
    
    static boost::iostreams::mapped_file_source src("input.txt");
    
    NONIUS_BENCHMARK("sample", [](nonius::chronometer cm) {
        std::vector<Point> points;
    
        using It = char const*;
        PointParser<It> g2;
    
        cm.measure([&](int) {
            It f = src.begin(), l = src.end();
            return phrase_parse(f, l, g2, qi::space, points);
            bool ok =  phrase_parse(f, l, g2, qi::space, points);
            if (ok)
                std::cout << "Parsed " << points.size() << " points\n";
            else
                std::cout << "Parsed failed\n";
    
            if (f!=l)
                std::cout << "Remaining unparsed input: '" << std::string(f,std::min(f+30, l)) << "'\n";
    
            assert(ok);
        });
    })
    

    Graph:

    enter image description here

    Another run output, live: