clojureinstaparse

How do we define a grammar for clojure code using instaparse?


I'm a newbie to parsing and wish to analyse some clojure code. I am hoping that someone can provide an example of how clojure code can be parsed using instaparse. I just need to do numbers, symbols, keywords, sexps, vectors and whitespace.

Some examples that I wish to parse:

(+ 1 2 
   (+ 3 4))

{:hello "there"
 :look '(i am 
           indented)}

Solution

  • Well there are two parts to your question. The first part is parsing the expression

    (+ 1 2 
       (+ 3 4))
    

    The second part is transforming the output to the result that you want. To get a good understanding of these principles, I highly recommend Udacity's Programming Languages course. Carin Meier's blog post is also quite helpful.

    The best way to understand how the parser will work, is to break it down into smaller parts. So in the first we'll just examine some parsing rules, and in the second part we'll build our sexps.

    1. A simple example

      You will first need to write a grammar that tells instaparse how to parse the given expression. We'll start by just parsing the number 1:

      (def parser
          (insta/parser
              "sexp = number
               number = #'[0-9]+'
              "))
      

      sexp describes the highest level grammar for the sexpression. Our grammar states that the sexp can only have a number. The next line states that the number can be any digit 0-9, and the + is similar to the regex + which means that it must have one number repeated any number of times. If we run our parser we get the following parse tree:

      (parser "1")     
      => [:sexp [:number "1"]]
      

      Ingoring Parenthesis

      We can ignore certain values by adding angled brackets < to our grammar. So if we want to parse "(1)" as simply 1 we can right our grammar as:

      (def parser
          (insta/parser
              "sexp = lparen number rparen
               <lparen> = <'('>
               <rparen> = <')'>
               number = #'[0-9]+'
              "))
      

      and if we run the parser again, it will ignore the left and right parenthesis:

      (parser "(1)")
      => [:sexp [:number "1"]]
      

      This will become helpful when we write the grammar for sexp below.

      Adding Spaces

      Now happens if we add spaces and run (parser "( 1 )")? Well we get an error:

      (parser "( 1 )")
      => Parse error at line 1, column 2:
         ( 1 )
          ^
         Expected: 
         #"[0-9]+"
      

      That's because we haven't defined the concept of space in our grammar! So we can add spaces as such:

      (def parser
          (insta/parser
              "sexp = lparen space number space rparen
               <lparen> = <'('>
               <rparen> = <')'>
               number = #'[0-9]+'
               <space>  = <#'[ ]*'> 
              "))
      

      Again the * is similar to the regex * and it means zero or more than one occurrence of a space. That means the following examples will all return the same result:

      (parser "(1)")         => [:sexp [:number "1"]]
      (parser "( 1 )")       => [:sexp [:number "1"]]
      (parser "(       1 )") => [:sexp [:number "1"]]
      
    2. Building the Sexp

      We're slowly going to build our grammar from the ground up. It might be useful to look at the final product here, just to give an overview of where we're headed.

      So, an sexp contains more than just numbers as defined by our simple grammar. One high level view we can have of sexp is to view them as an operation between two parenthesis. So basically as a ( operation ). We can write this directly into our grammar.

      (def parser
          (insta/parser
              "sexp = lparen operation rparen
               <lparen> = <'('>
               <rparen> = <')'>
               operation = ???
              "))
      

      As I stated above, the angled brackets < tell instaparse to ignore these values when it is making the parse tree. Now what is an operation? Well an operation consists of an operator, like +, and some arguments, like the numbers 1 and 2. So we can say write our grammar as:

      (def parser
          (insta/parser
              "sexp = lparen operation rparen
               <lparen> = <'('>
               <rparen> = <')'>
               operation = operator + args
               operator = '+'
               args = number
               number = #'[0-9]+'
              "))
      

      We stated only one possible operator, +, just to keep things simple. We have also included the number grammar rule from the simple example above. Our grammar, however, is very limited. The only valid sexp it can parse is (+1). That's because we haven't included the concept of spaces, and have stated that args can have only one number. So in this step we will do two things. We will add spaces, and we will state that args can have more than one number.

      (def parser
          (insta/parser
              "sexp = lparen operation rparen
               <lparen> = <'('>
               <rparen> = <')'>
               operation = operator + args
               operator = '+'
               args = snumber+
               <snumber> = space number
               <space>  = <#'[ ]*'> 
               number = #'[0-9]+'
              "))
      

      We added space by using the space grammar rule we defined in the simple example. We created a new snumber which is defined as space and a number, and added the + to snumber to state that it must appear once but it can repeat any number of times. So we can run our parser as so:

      (parser "(+ 1 2)")
      => [:sexp [:operation [:operator "+"] [:args [:number "1"] [:number "2"]]]]
      

      We can make our grammar more robust by having args reference back to sexp. That way we can have sexp in our sexp! We can do this by creating ssexp which adds a space to sexp and then add ssexp to args.

      (def parser
          (insta/parser
              "sexp = lparen operation rparen
               <lparen> = <'('>
               <rparen> = <')'>
               operation = operator + args
               operator = '+'
               args = snumber+ ssexp* 
               <ssexp>   = space sexp
               <snumber> = space number
               <space>  = <#'[ ]*'> 
               number = #'[0-9]+'
              "))
      

      Now we can run

      (parser "(+ 1 2 (+ 1 2))")
       =>   [:sexp
             [:operation
              [:operator "+"]
              [:args
               [:number "1"]
               [:number "2"]
               [:sexp
                [:operation [:operator "+"] [:args [:number "1"] [:number "2"]]]]]]]
      
    3. Transformations

      This step can be done using any number of tools that work on trees, such enlive, zippers, match, and tree-seq. Instaparse, however, also includes its own useful function called insta\transform. We can build our transformations by replacing the keys in our parse tree by the valid clojure functions. For example, :number becomes read-string to turn our strings into valid numbers, :args becomes vector to build our arguments.

      So, we want to transform this:

       [:sexp [:operation [:operator "+"] [:args [:number "1"] [:number "2"]]]]
      

      Into this:

       (identity (apply + (vector (read-string "1") (read-string "2"))))
       => 3
      

      We can do that by defining our transformation options:

      (defn choose-op [op]
       (case op
          "+" +))
      (def transform-options
         {:number read-string
          :args vector
          :operator choose-op
          :operation apply
          :sexp identity
       })
      

      The only tricky thing here was adding the function choose-op. What we want, is to pass the function + to apply, but if we replace operator with + it will use + as a regular function. So it will transform our tree to this:

       ... (apply (+ (vector ...
      

      But by using choose-op it will pass + as an argument to apply as such:

       ... (apply + (vector ...
      

    Conclusion

    We can now run our little interpreter by putting the parser and transformer together:

    (defn lisp [input]
       (->> (parser input) (insta/transform transform-options)))
    
    (lisp "(+ 1 2)")
       => 3
    
    (lisp "(+ 1 2(+ 3 4))")
       => 10
    

    You can find the final code used in this tutorial here.

    Hopefully, this short introduction is enough to get going on your own projects. You can new lines by declaring a grammar for \n and you can even choose to not ignore spaces in your parse tree by removing the angled brackets <. That might be helpful given that you're trying to keep the indentation. Hope this helps, If not just write a comment!