performanceparsingoptimizationclojureinstaparse

Unbelievably bad parse time


I need to analyze Elisp (Emacs Lisp) code so I wrote a parser for it using Instaparse. I expected it to be slow but doing 1k lines per second is way too slow to be right even on a calculator (or my pretty old i7). Can it be that bad or do I do something extremely wrong?

It's unambiguous and I tried to keep look ahead/behinds at minimum, unfortunately Elisp is very liberal with what constitutes as a symbol so I had to add some ahead/behinds there to differentiate numbers and symbols. Also I tried to deffer this by parsing symbols, numbers and keywords as "ident" it only gave me back like 30% of time. From my tests, it looks like Instaparse struggles a lot with recursive rules and lisps have highly recursive nature so maybe I didn't mess it up - it's just that slow...

The parser:

(ns slowparse
  (:require [clojure.string :as str]
            [instaparse.combinators :as c]
            [instaparse.core :as insta]))

(def grammar
  "Elisp grammar."
  "<root> = any +

  <any> = sexp | keyword | number | symbol | prefix | string | vector |
          comment | whitespace | char | Epsilon

  comment = comment-tok #'(?:[^\\n]*|$)'

  string = <str-l-tok> #'(?:(?:\\\\\\\\)|(?:\\\\\")|[^\"])*' <str-r-tok>

  char = <char-tok> #'(?:(?:\\\\(?:C|M)-)|(?:\\\\))?(?:.|\\s)'

  <whitespace> = <#'\\s+'>

  sexp   = sexp-l-tok any + sexp-r-tok

  vector = vec-l-tok any + vec-r-tok

  <prefix>   = quote | template | spread | hole

  <prfxbl> = sexp | symbol | keyword | number | prefix | vector

  quote    = quote-tok prfxbl
  template = tmpl-tok prfxbl
  hole     = hole-tok ! spread-tok prfxbl
  spread   = hole-tok spread-tok prfxbl

  <sexp-l-tok>      = <'('>
  <sexp-r-tok>      = <')'>

  <vec-l-tok>       = <'['>
  <vec-r-tok>       = <']'>

  <str-l-tok>       = <'\"'>
  <str-r-tok>       = <'\"'>

  <quote-tok>       = '#' ? <\"'\">

  <tmpl-tok>        = <'`'>

  <num-b-x-tok>     = '#'

  <hole-tok>        = <','>

  <spread-tok>      = <'@'>

  <comment-tok>     = <';'>

  <char-tok>        = '?'

  <kv-tok>          = <':'>

  symbol    = ! ( number | kv-tok | comment-tok | num-b-x-tok | char-tok )
               ident

  keyword = kv-tok ident

  number    = num-b10 | num-bx
  <num-b10> = #'[-+]?(?:(?:[\\d]*\\.[\\d]+)|(?:[\\d]+\\.[\\d]*)|(?:[\\d]+))' &
              ( ! ident )
  <num-bx>  = #'(?i)#(?:b|o|x|(?:\\d+r))[-+]?[a-z0-9]+'")

(def ident
  {:ident
   (let [esc-ch (str/join ["\\[" "\\]" "\\(" "\\)" "\"" "\\s" "'" "," "`" ";"])
         tmpl "(?:(?:\\\\[{{ec}}])|[^{{ec}}])+"]
     (->> esc-ch (str/replace tmpl "{{ec}}") c/regexp c/hide-tag))})

(insta/defparser ^{:doc "Elisp parser."} elisp-parser
  (merge ident (c/ebnf grammar))
  :start :root)

(def test-text (slurp "/tmp/foo.el"))

(time (insta/parse elisp-parser test-text))

Solution

  • As @akond suggested I ported the grammar to ANTLR (using https://github.com/aphyr/clj-antlr). It parses 1k lines in 20ms or less... Yeah looks like Instaparse is really slow.

    Didn't have to change much, but Instaparse definitely feels a lot nicer to write rules for. It has simple ordering and look ahead/behind, standard regex, easy way to hide junk.

    ANTLR grammar:

    (ns fastparse
      (:require [clj-antlr.core :as antlr]))
    
    (def grammar
      "Elisp grammar."
      "grammar EmacsLisp ;
    
       source: any* EOF ;
    
       any: list | keyword | number | symbol | prefix | string | vector | char |
            whitespace | comment;
    
       vector: '[' any* ']' ;
    
       list: '(' any* ')' ;
    
       prefix: quote | template | spread | hole ;
    
       quote: '#' ? '\\'' any ;
    
       template: '`' any ;
    
       spread: ',@' any ;
    
       hole: ',' any ;
    
       number: NUMB10 | NUMBX ;
    
       char: CHAR ;
    
       string: STRING ;
    
       keyword: KEYWORD ;
    
       symbol: IDENT ;
    
       whitespace: WS ;
    
       comment: COMLINE ;
    
       CHAR: '?' ( ( '\\\\' ( 'C' | 'M' ) '-' ) | '\\\\' )? . ;
    
       STRING: '\"' ( '\\\\\\\\' | '\\\\\"' | . )*? '\"' ;
    
       NUMB10: [+-] ? ( ( D* '.' D+ ) | ( D+ '.' D* ) | D+ ) ;
    
       NUMBX: '#' ( 'b' | 'o' | 'x' | ( D+ 'r' ) ) [-+]? ( A | D )+ ;
    
       fragment
       D: '0'..'9' ;
    
       fragment
       A: 'a'..'z' ;
    
       KEYWORD: ':' IDENT ;
    
       IDENT: ( ( '\\\\' [\\\\[\\]() \\n\\t\\r\"',`;] )+? |
                ( ~[[\\]() \\n\\t\\r\"',`;] )+? )+ ;
    
       COMLINE: ';' ~[\\n\\r]* ;
    
       WS: [ \\n\\t\\r]+ ;")
    
    (def elisp-str->edn (antlr/parser grammar))
    
    (def text (slurp "/tmp/foo.el"))
    
    (time (elisp-str->edn text))