YACCL: Yet Another Compiler Compiler Language

Introduction

YACCL is a self-bootstrapped parser and parser-generator. It's sole aim in life is to allow software developers to quickly create efficient parsers for arbitrarily complex grammars.

It can be used either as a class library from hand-written code; as a parser for grammar specifications with hand-written action routines; or as a code generator for the same, creating stand-alone (but easy to read and debug) parsers.

Enough with this equivocation, on with some examples.

Getting Started.

Download and unzip the YACCL ZIP-file distribution into some convenient directory by clicking here:

The ZIP file paths are all prefixed with the yaccl-1.0 prefix (or whatever the latest release is), so no need to worry about naming collisions.

Before you proceed with this tutorial, you need to configure the software. You need to build shell-scripts to launch the components in the correct manner.

This should require only execution of two commands:

        configure
        bin\setup (or source bin/setup on Unix)

As with most installations, there is plenty that can go wrong. If you see a message such as the following:

	C:\yaccl>configure
	WARNING: Unable to location 'tools.jar' on your path or classpath.
	WARNING: the 'yacclc' tool will probably fail to compile code.
	WARNING: java.home=C:\PROGRAM FILES\JAVA\J2RE1.4.0
	WARNING: java.class.path=.;d:\Blazix\blazix.jar;classes

you will find that the examples which demonstrate automatic code compilation of grammar derived source will fail to build. In this instance the problem is that the 'java.exe' JVM runtime is from a JRE, not a JDK.

There are two ways to solve this problem:

Add the 'bin' directory for a JDK java.exe to your path
Add the tools.jar file from your JDK to the CLASSPATH variable

YACCL is a developer tool, and you will need to have a JDK on your path for things to work properly. It was developed and tested with JDK 1.3.1-b24, the JDK which comes with JBuilder6, and should work properly with any later production-release JDK (avoid the beta's).

Before embarking on the tutorial it's a good idea to compile all the examples as follows:

        cd src
        javac examples/*.java

At this point there are two tracks you can take in this introduction document:

Building a parser by writing code
Building a parser using the YACCL grammar

Example 1: A simple calculator.

Well, what sort of parser tutorial would this be if we didn't implement a four-function calculator first. Here's how you do the parser, using the RDP class in the com.simontuffs.yaccl package:

        // A simple four function calculator (+,-,*,/).
        RDP parser = new RDP();
        Terminal sum = parser.new NonTerminal("sum");
        Terminal mul = parser.new NonTerminal("mul");
        Terminal integer = parser.new Chars('0','9').plus();

        sum.match(mul.star(parser.terminal("+").or("-").and(mul)));
        mul.match(integer.star(parser.terminal("*").or("/").and(integer)));
        parser.expression("1+2*3-4/5");
        System.out.println(parser.parse(sum));

Now if you've used parser generators before, you will know that writing eight lines of Java to do this job isn't bad. First instantiate the RDP (Recursive Descent Parser) object called parser, then use it to create two NonTerminal objects. NonTerminal is an inner class of RDP, as are most of the useful classes you will come across in this software.

The non-terminals sum and mul represent addition/multiplication operations as we'll see in a minute. The other non-terminal is integer, which uses the Chars character matching class to match one-or-more integer characters. Note: there is no need to worry about tokenizers in YACCL, tokenizing is performed as part of the parsing operation, which is a simple but very powerful approach.

If you compile and run this program (in src/example1.java) you should see "true" showing that the parser accepted the expression.

        $ cd src
        $ javac examples/example1.java
        $ java examples.example1
        true

Not the most illuminating result, granted, but you can get a much more interesting view into the operation of the parser by dumping the parse-tree that it built, by adding the code shown below after the call to parser.parse() has been made:


        $ java examples.example2

        parser.root.dump();

        dump:    <root#0>
        dump:      <sum#1>
        dump:        <mul#2>
        dump:          1
        dump:        +
        dump:        <mul#5>
        dump:          2
        dump:          *
        dump:          3
        dump:        -
        dump:        <mul#10>
        dump:          4
        dump:          /
        dump:          5

Reading this parse-tree, it shows a root node #0, followed by a single sum node, under which are three mul nodes that contain the higher precedence operations, and two terminals + and - which separate them, as expected from the input expression. It is clear that this tree could be evaluated as a stack-based expression:

        -(+(1,2*3),4/5)

as would be expected from the infix operations in the mathematical expression. (In case you're interested, with integer arithmetic the result is 7, as we will find out shortly).

Again, unlike some other parser-generators, YACCL automatically builds a parse-tree for an expression while it is parsing it. This parse-tree can be used to complete evaluation of the expression by traversing it after parsing is complete, but there is another (and often better) way.

Example 2: A simple calculator with action routines.

During parse of the expression, YACCL can call action routines when productions are matched. The easiest way to attach action routines to the productions/non-terminals is shown below:

Note: YACCL can also call action routines during intermediate matching, although you need to be careful if your grammar requires backtracking for its evaluation since the result of an action routine might need to be negated if the rule it was fired from was negated by backtracking -- more on this later in the backtracking section.

First, a couple of action routines must be created to handle nodes during the creation of the grammar. These extend the abstract RDP.Action class, and implement the fire(Node) method. We call them onSum and onMul to reflect the production/non-terminal names they are attached to in the constructors below:

        Action onSum = parser.new Action() {
            public void fire(Node node) {
                System.out.println("onSum(" + node.flatten() + ")");
            }
        };

        Action onMul = parser.new Action() {
            public void fire(Node node) {
                System.out.println("onMul(" + node.flatten() + ")");
            }
        };

        // A simple four function calculator (+,-,*,/).
        Terminal sum = parser.new NonTerminal("sum", onSum);
        Terminal mul = parser.new NonTerminal("mul", onMul);

The output from this parser is now slightly more interesting:

        $ java examples.example2 
        onMul(1)
        onMul(2*3)
        onMul(4/5)
        onSum(1+2*3-4/5)
        true
        ...

As each of the action routines is fired, it prints out a flattened version of the node it is passed. At this point a way to actually evaluate the mathematical expression should suggest itself.

Example 3: Evaluating the expression.

During onMul() we can peform the multiplication, and replace the node with the result. Actually, all that has to happen is to replace the node's token with the result, since that is what the Node.token is intended for. The following change (example3.java) shows this.

        Action onMul = parser.new Action() {
            public void fire(Node node) {
                // Evaluate the expression below here.
                int result = new Integer(node.tokenAt(0)).intValue();
                for (int i=2; i<node.children.size(); i+=2) {
                    int val = new Integer(node.tokenAt(i)).intValue();
                    if (node.tokenAt(i-1).equals("*")) result *= val;
                    else result /= val;
                }
                node.token = "" + result;
            }
        };

Note the subtleties here: there are potentially multiple children opf node, for example 1*2*3 will result in five children, '1', '*', '2', '*' and '3'. The resulting tree dump is almost complete now, showing that the mul non-terminals have been evaluated:

        dump:      <sum#1>
        dump:        1
        dump:          1
        dump:        +
        dump:        6
        dump:          2
        dump:          *
        dump:          3
        dump:        -
        dump:        0
        dump:          4
        dump:          /
        dump:          5

Example 4: All evaluation complete.

All that remains is to perform similar calculations in onSum, and the final result will be the token of that node, the value "7" (example4.java):

        Action onSum = parser.new Action() {
            public void fire(Node node) {
                // Evaluate the expression below here.
                int result = new Integer(node.tokenAt(0)).intValue();
                for (int i=2; i<node.children.size(); i+=2) {
                    int val = new Integer(node.tokenAt(i)).intValue();
                    if (node.tokenAt(i-1).equals("+")) result += val;
                    else result -= val;
                }
                node.token = "" + result;
            }
        };

        $java examples.example4
        dump:    <root#0>
        dump:      7
        dump:        1
        dump:          1
        dump:        +
        dump:        6
        dump:          2
        dump:          *
        dump:          3
        dump:        -
        dump:        0
        dump:          4
        dump:          /
        dump:          5

Using a grammar to build a parser

So far we have built a parser by writing Java code to invoke the RDP class directly. However, it more convenient (and often less confusing) to use a textual grammar to build the parser. YACCL supports this through a class called (not too surprisingly) YACCL, which takes a textual grammar and builds an in-memory parser for that grammar.

The simplest way to use YACCL is to feed it a textual grammar using the yaccl script, and let it build an in-memory parser which can then be used to parse an expression or file.

        $ yaccl examples/example5.g

Note: if you get an exception regarding class not found for examples.example5b, you need to compile the examples code as per the introduction to this document:

        $ javac examples/*.java

The grammar file in this case is a BNF-like representation of the Java code we wrote in example1.

    // This is the grammar file for example5.java.

    sum:        mul (('+'|'-') mul)*;
    mul:        integer (('*'|'/') integer)*;
    integer:    ['0'-'9']+;

    // Specify the action class.
    action-class = "examples.example5";
    expression = "1+2*3";

The structure of this file should be familiar to anyone who has worked with tools like YACC or JavaCC. This example shows two kinds entities, productions, and controls. Productions are of the form "x: a (b|c);", meaning that production/non-terminal 'x' is matched if the input stream contains 'a' followed by either 'b' or 'c'. Repeat operations include '*', '?' and '+' with the usual meanings of match zero-or-more, zero-or-one, one-or more. Character patterns are indicated by use of square brackets [].

Controls are directives to the parser controlling things such as (in this case) the expression to be parsed, and in this case which action-class to use.

An action-class is a class that contains public fields or methods which accept a single Node argument and which follow a simple naming convention. The RDP parser will automatically invoke an action routine when a production is matched, if it has a name starting with 'on', and the rest of the name matches the name of the production. So in our example we need methods called onSum(Node) and onMul(Node).

The code in example5.java is just a reworking of the action routines in example4.java


    RDP parser = new RDP();
    public Action onSum = parser.new Action() {
        public void fire(Node node) {
            // Evaluate the expression below here.
            int result = new Integer(node.tokenAt(0)).intValue();
            for (int i=2; i<node.children.size(); i+=2) {
                int val = new Integer(node.tokenAt(i)).intValue();
                if (node.tokenAt(i-1).equals("+")) result += val;
                else result -= val;
            }
            node.token = "" + result;
            node.dump();
        }
    };

    public Action onMul = parser.new Action() {
        public void fire(Node node) {
            // Evaluate the expression below here.
            int result = new Integer(node.tokenAt(0)).intValue();
            for (int i=2; i<node.children.size(); i+=2) {
                int val = new Integer(node.tokenAt(i)).intValue();
                if (node.tokenAt(i-1).equals("*")) result *= val;
                else result /= val;
            }
            node.token = "" + result;
        }
    };

Pay particular attention to the somewhat strange-looking use of an instance of the RDP class to create inner-class objects which resulted from the cut-and-paste of hand-written code. It is actually possible to simplify the action routines so that they only need be public methods accepting a single Node argument, as shown below (this code is in example5b.java), and to hook it up you need to change the action-class specifiation in example5.g to action-class="examples.example5b"

public class example5b {

    public void onSum(Node node) {
        // Evaluate the expression below here.
        int result = new Integer(node.tokenAt(0)).intValue();
        for (int i=2; i<node.children.size(); i+=2) {
            int val = new Integer(node.tokenAt(i)).intValue();
            if (node.tokenAt(i-1).equals("+")) result += val;
            else result -= val;
        }
        node.token = "" + result;
        node.dump();
    }

    public void onMul(Node node) {
        // Evaluate the expression below here.
        int result = new Integer(node.tokenAt(0)).intValue();
        for (int i=2; i<node.children.size(); i+=2) {
            int val = new Integer(node.tokenAt(i)).intValue();
            if (node.tokenAt(i-1).equals("*")) result *= val;
            else result /= val;
        }
        node.token = "" + result;
    }
}

Automatic Compilation

The final step in this tutorial is to incorporate the action code in with the grammar, and have it compiled automatically as part of generating a compilable grammar. This is easily demonstrated using the yacclc tool on the sum.g grammar in the examples directory:

        $ cd ..
        $ yacclc src/examples/sum.g

The grammar is parsed and is (or should be) converted into Java code in the file src/examples/sum/Sum.java. The Sum class contains a method called parser which returns an instance of an RDP. The action routines are also compiled into the Sum class, and the result is compiled into a stand-alone parser which can be used directly from the command line as follows:

Failures during compilation at this point mean that tools.jar is not on your classpath: go back and follow the configuration instructions at the start of this document.

        $ java -classpath ./gen:./jars/yaccl.jar examples.sum.Sum src/examples/sum.txt

Note that the choice of ':' or ';' as a classpath separator depends on your platform (Windows or Unix). Cygwin counts as Windows, so use '\;' to escape ';' from the shell.

A Full Bootstrap

One of the fun things about YACCL is that it is fully bootstrapped. By which I mean that it can parse and generate itself :)

The entire YACCL grammar is given below

    bootstrap: (production | control)+;
    production: prod action? ':' any ';';
    action: '{' literal '}' ;
    any: or ( '%' or )*;
    or: and ( '|' and )*;
    and: ( (item action? |  '!'? '(' or ')' {onRp} action? ) repeat)+;
    item: ( ( '!'? (string | literal | char | range) ) repeat | caction);
    repeat: ["*+?"] | limit | empty;
    literal: ['a'-'z' 'A'-'Z' '0'-'9' "_.-$"]+;
    prod: ['a'-'z' 'A'-'Z' '0'-'9' "_.-$"]+;
    ctl: ['a'-'z' 'A'-'Z' '0'-'9' "_.-$"]+;
    string: '"' [' '-'~' ~'"']* '"';
    char: ''' ('\\')? [' '-'~'] ''';
    int: ( "0x" ['0'-'9' 'a'-'f' 'A'-'F']+ | ['0'-'9']+);
    crange: ('~')? ( (char | int) '-' (char | int) | string | char);
    range: ('~')? '[' (crange | literal)* (stop)* ']' (["*+?"])?;
    stop: ('!')? (string | literal);
    limit: '<' int ('-' (int | "inf"))? '>';
    control: ctl '=' (bool | string | literal) ';';
    bool: ("true" | "false");
    caction: '<' control '>';
    linecomment: "//" [' '-'~' ~"\n"]* '\n';
    blockcomment: "/*" [' '-'~' " \t\n\r\f" !"*/"]* "*/";

A mere 23 productions, three of which are duplicates to simplify the implementation of action routines (literal, prod and ctl).

This code is taken from the file src/grammars/yaccl.g, which forms the core of the yaccl tool (take a look at bin/yaccl). By including the yaccl.g grammar you can easily build new grammars, which is how the yacclc compiler works: it uses the file src/grammars/actions.g to extend yaccl.g with a new section that allows in-line Java action routines to be specified as a class.

The extra productions needed to parse the Java class are shown below:

        action-class: "action" '{' class '}' ;
        class: package? import* class-spec '{' block '}' ;
        package: nl sp "package" ws name ';';
        class-spec: "public" ws "class" ws lit ws ;
        import: nl sp "import" ws name ".*"? ';' ws ;
        name: lit ('.' lit)* ;
        oc: '{' ;
        cc: '}' ;
        ws: [" \n"]+ ;
        sp: [' ']* ;
        nl: ['\n']? ;
        block: ( (ws | '.' | name | string | char | oper | oc block cc ) )* ;
        lit: ['a'-'z' 'A'-'Z' '0'-'9' "_-$"]+;
        oper: ["+-/*&|%();[]:?!=<>,"]+;

In the above production-set, a pass-through parser for the Java language is specified in 13 productions! This allows Java action routines to be placed inside the grammar, and parsed through to the output code with processing of package and class-names.

Backtracking

When it came to dealing with lookahead, I examined the arguments for lookahead versus backtracking, and found those in favor of lookahead to be somewhat uncompelling.

So I did (IMHO) the right thing, and put full LL(inf) backtracking into YACCL. What does this mean to you as a grammar writer? Simple. Don't worry about it. YACCL will do the right thing, sometimes faster than others :) You can turn on backtracking messages if you're worried about performance, or turn it off if you can get away with an LL(1) parser:

        backtrack = false;

but really, just forget it and concentrate on writing your grammar. At some future point I may install a LL(k) capability, which is really just a limitation on the decision history maintained by the parser.

Design Details

Maybe sometime. You can always read the code, but it's not heavily commented. This is, after-all, Free Software, and you get what you pay for. Anyone willing to sponsor my writing this, send me an email simon@simontuffs.com

What Next?

Here are some of the things I want to do with YACCL, but maybe you'd like to get involved and help out too.

Speed it up a lot
Build a parser for JavaCC grammars to aid migration
Build a full Java grammar (maybe leveraging the result above :)
Create C++, C, and C# parsers from YACCL grammars
Write a book on the construction and use of YACCL

Why bother?

While experimenting with a simple operator-based parser, it became clear to me that the approach I was taking was a dead end. If you're interested in what I was doing, please send me an email and I'll describe it. Anyway, I decided to take a look at another compiler-compiler technology similar to those I'd used in the past -- JavaCC -- and was surprised to find that it still used the same basic embed-all-your-code-in-the-grammar-and-generate-procedural-code technology that was used thirty years ago when YACC (which is, by the way, still a superb tool) was created.

Unable to resist the challenge, I decided to implement a recursive-descent parser in Java, and the end result is YACCL. It was both easier and harder than I'd imagined: easier at the beginning, then a lot harder at the end. Typical, enough early success to suck you in, then enough pain to remind you why you don't do this stuff very often.

Why call it YACCL?

All the other good names are taken (JAG, JAGG, YAPP, JAVAG, etc). So YACCL will have to do. This will make it impossible to find in the internet search engines (try it), which actually suits me just fine. YACCL stands for Yet Another Compiler Compiler Language in homage to the seminal YACC tool which has had such an impact on software development over the years. It could equally be called Yet Another Crazy Compiler LL Tool for those who are really into parser terminology. LL, of course, stands for Left-Recursive Lookahead, which is how RDP works its magic. Apologies to the YACC writers, since YACC is LALR and table-driven, not LL and recursive-descent. Hmm...

So who am I?

Dr P. Simon Tuffs is a software engineer with over twenty years experience in a variety of fields ranging from embedded systems to enterprise software. You can check out my full resume at http://www.simontuffs.com

YACCL was done for the best of all reasons: for the sheer challenge.

Oh, and by the way, thanks for reading this far.

License

YACCL is released under the terms of a BSD-style license, as found in the header of source-code and grammar files. See bsd-license. In addition, if you use YACCL to develop a product, or embed it in one (especially if you are building a commercial product), I require you to contact me (simon@simontuffs.com) to let me know that you are doing so, and help satisfy my curiosity as to whether this effort was worthwhile. Enjoy!

Other Stuff

The CVS repository at sourceforge contains experimental grammars for XML and EBNF (the XML grammar is written in EBNF, so I had to build that first). These are complex beasts, but if you're interested in working with them have a look, and please feel free to contact me (simon@simontuffs.com) if you can use my consulting services to build your own grammars using YACCL.