Well there are two parts to your question. The first part is parsing the expression
The second part is transforming the output to the result that you want. To get a good understanding of these principles, I highly recommend Udacity's Programming Languages course. Carin Meier's blog post is also quite helpful.
The best way to understand how the parser will work, is to break it down into smaller parts. So in the first we'll just examine some parsing rules, and in the second part we'll build our sexps.
A simple example
You will first need to write a grammar that tells instaparse how to parse the given expression. We'll start by just parsing the number 1
:
(def parser
(insta/parser
"sexp = number
number = #'[0-9]+'
"))
sexp describes the highest level grammar for the sexpression. Our grammar states that the sexp can only have a number. The next line states that the number can be any digit 0-9, and the +
is similar to the regex +
which means that it must have one number repeated any number of times. If we run our parser we get the following parse tree:
(parser "1")
=> [:sexp [:number "1"]]
Ingoring Parenthesis
We can ignore certain values by adding angled brackets <
to our grammar. So if we want to parse "(1)"
as simply 1
we can right our grammar as:
(def parser
(insta/parser
"sexp = lparen number rparen
<lparen> = <'('>
<rparen> = <')'>
number = #'[0-9]+'
"))
and if we run the parser again, it will ignore the left and right parenthesis:
(parser "(1)")
=> [:sexp [:number "1"]]
This will become helpful when we write the grammar for sexp below.
Adding Spaces
Now happens if we add spaces and run (parser "( 1 )")
? Well we get an error:
(parser "( 1 )")
=> Parse error at line 1, column 2:
( 1 )
^
Expected:
#"[0-9]+"
That's because we haven't defined the concept of space in our grammar! So we can add spaces as such:
(def parser
(insta/parser
"sexp = lparen space number space rparen
<lparen> = <'('>
<rparen> = <')'>
number = #'[0-9]+'
<space> = <#'[ ]*'>
"))
Again the *
is similar to the regex *
and it means zero or more than one occurrence of a space. That means the following examples will all return the same result:
(parser "(1)") => [:sexp [:number "1"]]
(parser "( 1 )") => [:sexp [:number "1"]]
(parser "( 1 )") => [:sexp [:number "1"]]
Building the Sexp
We're slowly going to build our grammar from the ground up. It might be useful to look at the final product here, just to give an overview of where we're headed.
So, an sexp contains more than just numbers as defined by our simple grammar. One high level view we can have of sexp is to view them as an operation between two parenthesis. So basically as a ( operation )
. We can write this directly into our grammar.
(def parser
(insta/parser
"sexp = lparen operation rparen
<lparen> = <'('>
<rparen> = <')'>
operation = ???
"))
As I stated above, the angled brackets <
tell instaparse to ignore these values when it is making the parse tree. Now what is an operation? Well an operation consists of an operator, like +
, and some arguments, like the numbers 1
and 2
. So we can say write our grammar as:
(def parser
(insta/parser
"sexp = lparen operation rparen
<lparen> = <'('>
<rparen> = <')'>
operation = operator + args
operator = '+'
args = number
number = #'[0-9]+'
"))
We stated only one possible operator, +
, just to keep things simple. We have also included the number grammar rule from the simple example above. Our grammar, however, is very limited. The only valid sexp it can parse is (+1)
. That's because we haven't included the concept of spaces, and have stated that args can have only one number. So in this step we will do two things. We will add spaces, and we will state that args can have more than one number.
(def parser
(insta/parser
"sexp = lparen operation rparen
<lparen> = <'('>
<rparen> = <')'>
operation = operator + args
operator = '+'
args = snumber+
<snumber> = space number
<space> = <#'[ ]*'>
number = #'[0-9]+'
"))
We added space
by using the space grammar rule we defined in the simple example. We created a new snumber
which is defined as space
and a number
, and added the +
to snumber to state that it must appear once but it can repeat any number of times. So we can run our parser as so:
(parser "(+ 1 2)")
=> [:sexp [:operation [:operator "+"] [:args [:number "1"] [:number "2"]]]]
We can make our grammar more robust by having args
reference back to sexp
. That way we can have sexp in our sexp! We can do this by creating ssexp
which adds a space
to sexp
and then add ssexp
to args
.
(def parser
(insta/parser
"sexp = lparen operation rparen
<lparen> = <'('>
<rparen> = <')'>
operation = operator + args
operator = '+'
args = snumber+ ssexp*
<ssexp> = space sexp
<snumber> = space number
<space> = <#'[ ]*'>
number = #'[0-9]+'
"))
Now we can run
(parser "(+ 1 2 (+ 1 2))")
=> [:sexp
[:operation
[:operator "+"]
[:args
[:number "1"]
[:number "2"]
[:sexp
[:operation [:operator "+"] [:args [:number "1"] [:number "2"]]]]]]]
Transformations
This step can be done using any number of tools that work on trees, such enlive, zippers, match, and tree-seq. Instaparse, however, also includes its own useful function called insta\transform
. We can build our transformations by replacing the keys in our parse tree by the valid clojure functions. For example, :number
becomes read-string
to turn our strings into valid numbers, :args
becomes vector
to build our arguments.
So, we want to transform this:
[:sexp [:operation [:operator "+"] [:args [:number "1"] [:number "2"]]]]
Into this:
(identity (apply + (vector (read-string "1") (read-string "2"))))
=> 3
We can do that by defining our transformation options:
(defn choose-op [op]
(case op
"+" +))
(def transform-options
{:number read-string
:args vector
:operator choose-op
:operation apply
:sexp identity
})
The only tricky thing here was adding the function choose-op
. What we want, is to pass the function +
to apply
, but if we replace operator
with +
it will use +
as a regular function. So it will transform our tree to this:
... (apply (+ (vector ...
But by using choose-op
it will pass +
as an argument to apply
as such:
... (apply + (vector ...
We can now run our little interpreter by putting the parser and transformer together:
Hopefully, this short introduction is enough to get going on your own projects. You can new lines by declaring a grammar for \n
and you can even choose to not ignore spaces in your parse tree by removing the angled brackets <
. That might be helpful given that you're trying to keep the indentation. Hope this helps, If not just write a comment!