ASM development log 3: DSLs

Posted on by Idorobots

"Kajtek, you incredibly handsome stallion, what have you been doing these past few months?" - you might ask, concerned about the lack of ASM dev logs recently...

Well, I've mostly been studying various PL design quirks and prototyping neat features such as vau calculus flavoured fexprs or the following piece of what I consider art.

As mentioned before, I was working on an extensible reader that would support user-defined reader macros for easy DSL programming. It took me quite some time, effort and researching but eventually I came up with an awesome solution. It's worthy to mention here, that it's in no way a new solution. I prefer to figure stuff out myself and often times it turns out that similar concept already existed a few years before mine. Bummer u_u.

PEG reader generator

The idea is simple - modify the syntax of the hosting language using full capabilities of said language extending its functionality on the fly. As simple as it may sound it's not simple at all.

Most common parsing techniques such as using CFG and LL(k) or LR(k) parsers are not particularly well-suited for this task, because they require a separate phase of lexical analysis which, while sounding innocently, complicated the design of the reader sufficiently enough for me to drop the idea for several weeks. Long story short, I started looking for something different but equally powerful, such as hygienic macros or fexprs with further improvements, and eventually stumbled upon this.

PEG is a scannerless parsing technique designed as an alternative to CFG's. PEG's are expressive and packrat parsers generated for PEG's work in guaranteed linear time thanks to memoization, but what's most important - packrat parsers are dead simple to generate.

My second experimental (but most likely final) reader design is based on PEG parser generation - given a set of grammar rules and (optional) code transformations it generates immensely recursive (but not yet memoized) parser that can be easily extended with new rules thanks to (but not necessarily requiring) late binding.

Here's an example, which you can try out yourself by importing experimental.parser2:

(grammar ((Expression < (/ String List Atom)))
         ((String     <- (:"\"") "[^\"]*" (: "\"")))
         ((List       < (: "\\​(") (* Expression) (: "\\)"))
                        `($(car List)
                          $(cdr List)))
         ((Atom       <- (/ Number Symbol)))
         ((Number     <- "[+\\-]?[0-9]+(\\.[0-9]*)?")
                         `($(car Number)
                           ($(str->num (caadr Number)))))
         ((Symbol     <- (! Number) "[^\\​(\\)\"';\\s]+")
                         `($(car Symbol)
                           ($(str->symbol (caadr Symbol)))))
         ((Spacing    <- (* (/ Comment "[\\s]+"))))
         ((Comment    <- (: ";" "[^\n]*\n"))))

(Expression "(define (compose f g)
               (lambda (arg) (f (g arg))))")

Few things to notice about this simple Lisp grammar definition:

  • It's very succinct and it mostly resembles a formal grammar definition of Lisp.
  • It uses quasiquotation and other hosting language features to modify the parse tree. Seemingly clumsy, code transformations will be further improved once pattern matching is implemented in the language.
  • It uses regexps for convenience.
  • It doesn't look anything like PEG's - dropping operator precedence simplifies the generator greatly. It's quite ironic, since this reader facility will most likely be used to introduce operator precedence and complex syntax rules to ASM DSLs. The correspondence of ASM reader syntax to PEG syntax is represented on the following snippet:

sequence:        (A B C ...)   => A B C ...
ordered choice:  (/ A B C ...) => A / B / C / ...
optional branch: (? ...)     => (...)?
zero or more:    (* ...)     => (...)*
one or more:     (+ ...)     => (...)+
not:             (! ...)     => !(...)
and (lookahead): (& ...)     => &(...)

drop node:       (: ...)     -  Drops a node from the parse tree.
concat captures: (~ ...)     -  Concatenates captures.

Grammar macros

As stated above, the reader is extensible and defining new grammar rules is very straightforward:

(syntax (Lambda < List (: "->") Expression)
  `($(car Lambda)
    ((lambda $(caadr Lambda) $(cadadr Lambda)))))

(syntax (Expression < (/ Lambda String List Atom)))

(Expression "(map (x) -> (* x x)
                  (list 1 2 3 4 5))")

It also explicitly states the interaction of a new macro with the rest of the grammar making it safe to use and easy to debug.

Here's another example, that extends ASM grammar to support direct manipulation of JSON objects:

(grammar ((JValue  < (/ String Number JObject
                        JArray JTrue JFalse JNull)))
         ((JObject < (: "\\{") (? JPair (* (:",") JPair)) (:"\\}"))
           `($(car JObject)
             ($(cons 'scope (cadr JObject)))))
         ((JPair   < String (: ":") JValue)
           `($(car JPair)
             ((var $(str->symbol (caadr JPair)) $(cadadr JPair)))))
         ((JArray  < (:"\\​[") (? JValue (* (:",") JValue)) (:"\\]"))
           `($(car JArray)
             ($(vectorof (cadr JArray)))))
         ((JTrue   <- (: "true"))
           `($(car JTrue)
         ((JFalse  <- (: "false")))
         ((JNull   <- (: "null"))))

(syntax (Expression < (/ Quote JObject String List Atom)))

And after running the experimental REPL by invoking (repl) you can use it like this:

(var json { "number": 1234.567,
            "string": "abcd",
            "object": { "member0": "value",
                        "member1": "value" },
            "boolean": true,
            "null": null })

(set! (json number) 9876.543)

Neat, isn't it?

Reader, writer... Compiler?

There's much space for improvement in the design and at the moment I'm trying to come up with a simple and natural way to sew ASM writer (facility responsible for printing ASM code) together with the reader. Ideally defining one grammar will be enough to generate both the reader and a complimentary writer, however custom writers will be possible facilitating language-to-language translation.For example, ASM' bytecode compiler might be implemented in terms of such a writer. Ahhh, the delicious symmetry.

Most likely pattern matching and some form of sealer/unsealer pattern will be used to achieve this.

I've recently started using Org2Blog to manage this blog and it turned out to be such an awesome tool that I feel obliged to mention it here. So there's that.

2016-02-16: Not true anymore, also adjusted some links & tags.