I need to analyze Elisp (Emacs Lisp) code so I wrote a parser for it using Instaparse. I expected it to be slow but doing 1k lines per second is way too slow to be right even on a calculator (or my pretty old i7). Can it be that bad or do I do something extremely wrong?
It's unambiguous and I tried to keep look ahead/behinds at minimum, unfortunately Elisp is very liberal with what constitutes as a symbol so I had to add some ahead/behinds there to differentiate numbers and symbols. Also I tried to deffer this by parsing symbols, numbers and keywords as "ident" it only gave me back like 30% of time. From my tests, it looks like Instaparse struggles a lot with recursive rules and lisps have highly recursive nature so maybe I didn't mess it up - it's just that slow...
The parser:
(ns slowparse
(:require [clojure.string :as str]
[instaparse.combinators :as c]
[instaparse.core :as insta]))
(def grammar
"Elisp grammar."
"<root> = any +
<any> = sexp | keyword | number | symbol | prefix | string | vector |
comment | whitespace | char | Epsilon
comment = comment-tok #'(?:[^\\n]*|$)'
string = <str-l-tok> #'(?:(?:\\\\\\\\)|(?:\\\\\")|[^\"])*' <str-r-tok>
char = <char-tok> #'(?:(?:\\\\(?:C|M)-)|(?:\\\\))?(?:.|\\s)'
<whitespace> = <#'\\s+'>
sexp = sexp-l-tok any + sexp-r-tok
vector = vec-l-tok any + vec-r-tok
<prefix> = quote | template | spread | hole
<prfxbl> = sexp | symbol | keyword | number | prefix | vector
quote = quote-tok prfxbl
template = tmpl-tok prfxbl
hole = hole-tok ! spread-tok prfxbl
spread = hole-tok spread-tok prfxbl
<sexp-l-tok> = <'('>
<sexp-r-tok> = <')'>
<vec-l-tok> = <'['>
<vec-r-tok> = <']'>
<str-l-tok> = <'\"'>
<str-r-tok> = <'\"'>
<quote-tok> = '#' ? <\"'\">
<tmpl-tok> = <'`'>
<num-b-x-tok> = '#'
<hole-tok> = <','>
<spread-tok> = <'@'>
<comment-tok> = <';'>
<char-tok> = '?'
<kv-tok> = <':'>
symbol = ! ( number | kv-tok | comment-tok | num-b-x-tok | char-tok )
ident
keyword = kv-tok ident
number = num-b10 | num-bx
<num-b10> = #'[-+]?(?:(?:[\\d]*\\.[\\d]+)|(?:[\\d]+\\.[\\d]*)|(?:[\\d]+))' &
( ! ident )
<num-bx> = #'(?i)#(?:b|o|x|(?:\\d+r))[-+]?[a-z0-9]+'")
(def ident
{:ident
(let [esc-ch (str/join ["\\[" "\\]" "\\(" "\\)" "\"" "\\s" "'" "," "`" ";"])
tmpl "(?:(?:\\\\[{{ec}}])|[^{{ec}}])+"]
(->> esc-ch (str/replace tmpl "{{ec}}") c/regexp c/hide-tag))})
(insta/defparser ^{:doc "Elisp parser."} elisp-parser
(merge ident (c/ebnf grammar))
:start :root)
(def test-text (slurp "/tmp/foo.el"))
(time (insta/parse elisp-parser test-text))
As @akond suggested I ported the grammar to ANTLR (using https://github.com/aphyr/clj-antlr). It parses 1k lines in 20ms or less... Yeah looks like Instaparse is really slow.
Didn't have to change much, but Instaparse definitely feels a lot nicer to write rules for. It has simple ordering and look ahead/behind, standard regex, easy way to hide junk.
ANTLR grammar:
(ns fastparse
(:require [clj-antlr.core :as antlr]))
(def grammar
"Elisp grammar."
"grammar EmacsLisp ;
source: any* EOF ;
any: list | keyword | number | symbol | prefix | string | vector | char |
whitespace | comment;
vector: '[' any* ']' ;
list: '(' any* ')' ;
prefix: quote | template | spread | hole ;
quote: '#' ? '\\'' any ;
template: '`' any ;
spread: ',@' any ;
hole: ',' any ;
number: NUMB10 | NUMBX ;
char: CHAR ;
string: STRING ;
keyword: KEYWORD ;
symbol: IDENT ;
whitespace: WS ;
comment: COMLINE ;
CHAR: '?' ( ( '\\\\' ( 'C' | 'M' ) '-' ) | '\\\\' )? . ;
STRING: '\"' ( '\\\\\\\\' | '\\\\\"' | . )*? '\"' ;
NUMB10: [+-] ? ( ( D* '.' D+ ) | ( D+ '.' D* ) | D+ ) ;
NUMBX: '#' ( 'b' | 'o' | 'x' | ( D+ 'r' ) ) [-+]? ( A | D )+ ;
fragment
D: '0'..'9' ;
fragment
A: 'a'..'z' ;
KEYWORD: ':' IDENT ;
IDENT: ( ( '\\\\' [\\\\[\\]() \\n\\t\\r\"',`;] )+? |
( ~[[\\]() \\n\\t\\r\"',`;] )+? )+ ;
COMLINE: ';' ~[\\n\\r]* ;
WS: [ \\n\\t\\r]+ ;")
(def elisp-str->edn (antlr/parser grammar))
(def text (slurp "/tmp/foo.el"))
(time (elisp-str->edn text))