PCT: Parrot Compiler Tools

So far we've talked a lot about low-level Parrot programming with PIR and PASM. However, the true power of Parrot is its ability to host programs written in high level languages such as Perl 6, Python, Ruby, Tcl, and PHP. In order to write code in these languages developers need there to be compilers that convert from the language into PIR or PASM (or even directly convert to Parrot Bytecode). People who have worked on compilers before may be anticipating us to use terms like "Lex and Yacc" here, but we promise that we wont.

Instead of traditional lexical analyzers and parser-generators that have been the mainstay of compiler designers for decades, Parrot uses an advanced set of parsing tools called the Parrot Compiler Tools (PCT). PCT uses a subset of the Perl 6 programming language called Not Quite Perl (NQP) and an implementation of the Perl 6 Grammar Engine (PGE) to build compilers for Parrot. Instead of using traditional low-level languages to write compilers, we can use a modern dynamic language like Perl 6 to write it instead. On a more interesting note, this means that the Perl 6 compiler is itself being written in Perl 6, a mind-boggling process known as bootstrapping.

PCT Overview

PCT is a collection of classes which handle the creation of a compiler and driver program for a high-level language. The PCT::HLLCompiler class handles building the compiler front end while the PCT::Grammar and PCT::Grammar::Actions classes handle building the parser and lexical analyzer. Creating a new HLL compiler is as easy as subclassing these three entities with methods specific to that high-level language.

While it was originally designed as the backend for Perl 6, Parrot has since grown to become a language agnostic virtual machine. However, PCT is still very tied to Perl 6 development, and was actually developed by Perl 6 developers for Perl 6 development. The fact that it's very powerful, flexible, and useful for other languages is a nice bonus, And a mark of good design! but the needs of other languages is not the driving force behind PCT. Perl 6 is such a large and varied language that a tool capable of implementing it is also capable of implementing many other languages as well, a fact that has been exploited by many language designers and implementers on Parrot.

Grammars and Action Files

Creating a compiler using PCT requires three basic files, plus any additional files needed to implement the languages logic and library:

make_language_shell.pl

The Parrot repository contains a number of helpful utilities for doing some common development and building tasks with Parrot. These utilities are all currently written in Perl 5, although in future releases the majority will likely be built in Perl 6 and run on Parrot directly. This is a bootstrapping problem that the developers have not tackled yet.

One of the tools of use to new compiler designers and language implementers is make_language_shell.pl. make_language_shell.pl is a tool for automatically creating all the necessary stub files for creating a new compiler for Parrot. It generates the driver file, parser grammar and actions files, builtin functions stub file, makefile, and test harness. All of these are demonstrative stubs and will obviously need to be edited furiously or even completely overwritten, but they give a good idea of what is needed to start on development of the compiler.

make_language_shell.pl is designed to be run from within the Parrot repository file structure. It creates a subfolder in /languages/ with the name of your new language implementation. Typically a new implementation of an existing language is not simply named after the language, but is given some other descriptive name to let users know it is only one implementation available. Consider the way Perl 5 distributions are named things like "Active Perl" or "Strawberry Perl", or how Python distributions might be "IronPython" or "VPython". If, on the other hand, you are implementing an entirely new language, you don't need to give it a fancy distribution name.

Parsing Fundamentals

Compilers typically consist of three components: The lexical analyzer, the parser, and the code generator This is an oversimplification, compilers also may have semantic analyzers, symbol tables, optimizers, preprocessors, data flow analyzers, dependency analyzers, and resource allocators, among other components. All these things are internal to Parrot and aren't the concern of the compiler implementer. Plus, these are all well beyond the scope of this book. The lexical analyzer converts the HLL input file into individual tokens. A token may consist of an individual punctuation mark("+"), an identifier ("myVar"), or a keyword ("while"), or any other artifact that cannot be sensibly broken down. The parser takes a stream of these input tokens, and attempts to match them against a given pattern, or grammar. The matching process orders the input tokens into an abstract syntax tree (AST), which is a form that the computer can easily work with. This AST is passed to the code generator which converts it into code of the target language. For something like the GCC C compiler, the target language is machine code. For PCT and Parrot, the target language is PIR and PBC.

Parsers come in two general varieties: Top-down and bottom-up. Top-down parsers start with a top-level rule, a rule which is supposed to represent the entire input. It attempts to match various combination of subrules until the entire input is matched. Bottom-down parsers, on the other hand, start with individual tokens from the lexical analyzer and attempt to combine them together into larger and larger patterns until they produce a top-level token.

PGE itself is a top-down parser, although it also contains a bottom-up operator precedence parser, for things like mathematical expressions where bottom-up methods are more efficient. We'll discuss both, and the methods for switching between the two, throughout this chapter.

Driver Programs

The driver program for the new compiler must create instances of the various necessary classes that run the parser. It must also include the standard function libraries, create global variables, and handle commandline options. Most commandline options are handled by PCT, but there are some behaviors that the driver program will want to override.

PCT programs can, by default, be run in two ways: Interactive mode, which is run one statement at a time in the console, and file mode which loads and runs an entire file. For interactive mode, it is necessary to specify information about the prompt that's used and the environment that's expected. Help and error messages need to be written for the user too.

HLLCompiler class

Parrot Grammar Engine

The Parrot Grammar Engine is an implementation of the Perl 6 grammar syntax in Parrot. The grammar engine in Perl 6 is much more advanced and flexible then the regular expression syntax of Perl 5. Since most other languages who implement regular expressions use a copy (or a subset) of Perl 5's regular expressions, PGE is much more powerful then those too.

PGE uses a recursive descent algorithm it also uses an operator precedence parser, when you ask it nicelywhich should be very familiar to users of the Perl 5 module Parse::RecDescent. In fact, both were originally designed by the same developer, Damian Conway. Recursive Descent, for those who get into the algorithmic details, is a top-down parsing algorithm, unlike the bottom-up LALR algorithm used in parser- generators like Yacc and Bison. Most programmers won't notice the difference between the two for most applications, although there are some specific places where the behavior will be different.

Rules and Actions

A recursive descent parser, like the one used in PGE, is a top-down parser. This means it attempts to start at the highest-level rule and work its way down to the individual input tokens in order to match the given input. Once the parser has matched the entire input a source code file, or a line of input at the terminal in interactive mode the parse is considered successful and the generated AST is delivered to the code generator for conversion into PIR.

The parser is formed by creating a grammar. Grammars consist of three primary components: rules, tokens, and protoregexes.

When a rule matches a sequence of input tokens, an associated method in NQP is called to convert that match into an AST node. This node is then inserted into the parse tree.

Basic Rules

Let's start off with a simple rule:

 rule persons_name {
    <first_name> <last_name>
 }

We also define the two name tokens as:

 token first_name { <alpha>+ }
 token last_name { <alpha>+ }

The special token <alpha> is a built-in construct that only accepts upper case and lower case letters. The "+" after the <alpha> tag is a short way of saying "one or more". Our rule persons_name would match something like Darth Vader Actually, it matches a lot of things that aren't people's namesbut wouldn't match something like C 3P0. Notice that the rule above would match Jar Jar Binks, but not the way you would expect: It would match the first "Jar" as <first_name> and the second "Jar" as <last_name> and wouldn't match "Binks" at all.

In PGE, the top-level rule which starts the match process and must successfully match in order for the compiler to continue is always called TOP. Without a TOP rule, your compiler won't do anything Actually, it will print out an error saying that you have no TOP rule, and then it will exit, but that isn't anything that we would want it to do.

Here's an example TOP rule that uses our definition for <persons_name> above and matches a file that contains a comma- separated list of names:

 rule TOP {
    [ <persons_name> ',' ]*
 }

this example shows another new construct, the square brackets. Square brackets are ways to group things together. The star at the end means that we take all the things inside the brackets zero or more times. This is similar to the plus, except the plus matches one or more times. Notice, however, that the above rule always matches a comma at the end, so we would need to have something like:

 Darth Vader, Luke Skywalker,

Instead of something more natural like:

 Darth Vader, Luke Skywalker

We can modify the rule a little bit so that it always ends with a name instead of a comma:

 rule TOP {
    [ <persons_name> ',' ]* <persons_name>
 }

Now we don't need a trailing comma, but at the same time we can't match an empty file because it always expects to have at least one name at the end. If we still want to match empty files successfully, we need to make the whole rule optional:

 rule TOP {
    [ [ <persons_name> ',' ]* <persons_name> ]?
 }

We've grouped the whole rule together in another set of brackets, and put a "?" question mark at the end. The question mark means zero or one of the prior item.

The symbols "*" (zero or more), "+" (one or more) and "?" are called quantifiers, and allow an item in the rule to match a variable number of times. These aren't the only quantifiers, but they are the most common. We will talk about other quantifiers later on.

Calling Actions

We haven't covered actions yet, but it's still important now to talk about how we will call them when we are ready. We call an action by inserting the {*} token into the rule. When the {*} rule is encountered, PGE calls the associated action method with the current match object as an argument. Let's take our persons_name rule from above, and sprinkle liberally with action calls:

 rule persons_name {
    {*} <first_name> {*} <last_name> {*}
 }

The first call to the action method contains an empty match object because the parser hasn't had a chance to match anything yet. The second call contains only the first name of the match. The third and final call contains both the matched first and last name. Notice that if the match fails halfway through, we still call the actions where we succeeded, but do not call the actions after the failure. So, if we try to match the string "Leia", the action is called before the name and after the first name. When the rule tries to match the last name, it fails because no last name is provided, and the third action method call is never made.

Alternations and Keys

In addition to sub-rules, groups, and quantifiers, we also are able to take alternations between options that are either-or. The vertical bar token "|" can be used to distinguish between options where only one may match, Here's an example:

 rule hero {
    ['Luke' | 'Leia'] 'Skywalker'
 }

This rule will match either "Luke Skywalker" or "Leia Skywalker" but won't match "Luke Leia Skywalker" or anything else, for that matter. With things like alternations, if we want to call an action method it's helpful to distinguish which combination we matched:

 rule hero {
    [
      'Luke' {*}    #= Luke
    | 'Leia' {*}    #= Leia
    ]
    'Skywalker'
 }

This is the same rule, except now it passes two arguments to its action method: the match object and the name of the person who got matched.

Warning: Left Recursion

Getting into all the nitty-gritty theory behind parsers is well beyond the scope of this book. However, there is one potential pitfall that developers should be made aware of that is not immediately obvious. Like functions in ordinary procedural or functional languages, the methods in the PGE parser grammar can call themselves recursively. Consider the following rules derived in part from the grammar for the C programming language:

 rule if_statement {
    'if' <condition> '{' <statement>* '}' <else_block>?
 }

 rule statement {
    <if_statement> | <expression>
 }

 rule else_block {
    'else' '{' <statements>* '}'
 }

Notice that an if_statement can contain a list of statements, and that each statement may itself be an if_statement? This is called recursion , and is part of where the "Recursive Descent" algorithm gets its name from.

Now, let's look at a more direct example of a comma-separated list of integer digits to form an array. We can define this recursively as follows:

 rule list {
     <list> ',' <digit> | <digit>
 }

The intention is that if there is only one digit, we match the second option in the alternation, and if there are more digits we can match them recursively in the first alternation. However, take a close look at the insidious result. The recursive descent parser enters the list rule. It's first option is to enter the list rule again, so it does. Recursive descent is a depth-first algorithm, and it will continue to descent down a particular path until it finds a successful match or a match failure. In this case, it matches list and then it matches list again, then it matches list again, and so on and so forth. What we have created is an infinite loop pattern called left recursion.

Left recursion is caused when the left-most item of the left-most alternation is a recursion. The rule above can be easily resolved by writing:

 rule list {
    <digit> | <list> ',' <digit>
 }

Or even

 rule list {
    <digit> ',' <list> | <digit>
 }

Both of these two options make sure the left-most item in our rule is not a recursion, therefore preventing left recursion.

Here is a more tricky example where the left recursion is hidden from view:

 rule term {
    <expression> '*' <term> | <digit>
 }

 rule expression {
    <term> '+' <expression> | <term>
 }

This is a very limited subset of mathematical equations that we might like to write, and even in this small subset we have this same problem: To match a term, the parser first tries to match an expression, which in turn matches a term and then an expression ...

Left recursion is not the only problem you can run into with a recursive descent grammar, but it's one that's likely going to come up relatively often for new language designers, and one that is not always likely to generate useful error messages.

Operator Precedence Parser

Places where there are lots of little tokens in a statement, and where there are lots of possible options that a top-down parser will have to attempt can become relatively inefficient using PCT's recursive descent parser. Specifically, mathematical expressions are very open-ended and have forms that are difficult to anticipate. Consider the expression:

 a + b * c + d

The recursive descent parser is going to have to recognize through significant trial and error how this statement should be parsed. For tasks like this, recursive descent parsers are not ideal, although a type of bottom-up parser called an operator precedence parser is. Operator precedence parsers work similarly to more versatile bottom-up parsers such as Lex or Yacc, but are optimized for use with expressions and equations. The "things" in an equation are split into two subtypes: terms and operators. Operators themselves are split into a number of types including prefix (-a), postfix (i++), infix (x + y), circumfix ([z]), postcircumix (a[b]), and list (1, 2, 3). Each operator gets its own precedence number that specifies how closely it binds to the terms. In the example above, the expression is parsed

 a + (b * c) + d

This is because the * operator has a higher precedence and therefore binds more tightly then the + operator.

To switch from the top-down recursive descent parser to the bottom-up operator precedence parser, a rule must be defined that is an optable :

 rule expression is optable { ... }

The ... ellipses aren't an editorial shortcut, it's the Perl 6 operator that is used to define a function signature. The ... indicates that this is just a signature and that the actual guts of it will be filled in somewhere else. In this case, that "somewhere else" is in the definition of the optable role.

Protofunction Definitions

Protofunctions are used to define operators in the optable in the same way that rules and tokens are used throughout the rest of the grammar. A proto is a way of saying that the rule is overridable dynamically, and that it might be defined somewhere else. In this case, PCT takes information from the proto declaration and fills in the details for us. On another note, this also means that the HLL itself can modify its own grammar at run time, by overriding the proto definitions for its operator table. Some languages call this process "operator overloading".

A proto is defined like this, taking some of our grammar rules above:

 'proto' <proto_name> [ 'is' <property> ] '{' '...' '}'

The name of the operator, listed as <proto_name> above, contains both a location part and an identifier part. The location is one of the places where the operator can be located, such as infix, postfix, prefix, circumfix, and postcircumfix. The name of the operator is the symbol used for the operator in any of the quotes that Perl 6 understands:

 proto infix:<+>                  # a + b
 proto postfix:'--'               # i--
 proto circumfix:«<>»             # <x>

The is keyword defines a property of the rule. Some examples of this are:

 is precedence(1)     # Specifies an exact precedence
 is equiv('+')        # Has the same precedence as the "+" operator
 is assoc('right')    # Right associative. May also be "left" or "list"
 is pirop('add')      # Operands are passed to the PIR operator "and"
 is subname('mySub')  # Operands are passed to the function "mySub"
 is pasttype('if')    # Operands are passed as children to an "if" PAST node in
                      # the parse tree
 is parsed(&myRule)   # The token is parsed and identified using the rule
                      # "myRule" from the top-down parser

Protofunction definitions are function signatures which can be overridden via multimethod dispatch. This means functions can be written with the same name as the rule to implement the behavior of the operator:

 rule infix:"+" { ... }

And in a PIR file for built-in functions:

 .sub 'infix:+'
    .param pmc a
    .param pmc b
    .local pmc c
    c = a + b
    .return(c)
 .end

The question to ask then is "Why have an is subname() property, if all operators can be defined as subroutines?" The answer is that using the is subname() property allows PCT to call a subroutine of a different name then the operator. This is a good idea if there is already a built-in function in the language that duplicates the functionality of the operator. There is no sense duplicating functionality, is there?

The great thing about protos being overloadable is that you can specify different functions to call with different signatures:

 .sub 'infix:+' :multi('Integer', 'Integer')
    ...
 .end

 .sub 'infix:+' :multi('CLispRatio', 'Number')
    ...
 .end

 .sub 'infix:+' :multi('Perl6Double', 'PythonInteger')
    ...
 .end

This list can be a bit intimidating, and it's hard to imagine that it would be necessary to write up a new function to handle addition between every conceivable pair of operands. Fortunately for us all, this isn't the case because all these data types have those VTABLE interfaces that we can use. For most data types Parrot already has basic arithmetic operations built in, and it's only necessary to override for those data types with special needs. This example was only a demonstration of the flexibility of the method.

Actions and NQP

Protofunction signatures aren't the only way to apply functions to rules matched by the parser. In fact, they might be the most primative because they use PIR code to implement the operator logic. Another way has been made available, by programming function actions in a language that's almost, but Not Quite Perl (NQP).

NQP is a small language that's implemented as a subset of Perl 6 syntax and semantics. It's represents almost the smallest subset of the Perl 6 language necessary to implement the logic of a parser, although some developers have complained enough to get a few extra syntactic features added in above the bare minimum. NQP also happens to be a Perl 6 subset that's not entirely dissimilar from Perl 5, so Perl 5 programmers should not be too lost when writing NQP.

NQP Basics

Like Perl, NQP uses sigils to differentiate different types of variables. The $ sigil is used for scalars, @ is used for arrays, and % is used for hashes Perl 6 aficionados will know that this isn't entirely true, but an in-depth look at Perl 6's context awareness is another topic for another book. A "scalar" is really any single value, and can interchangably be given a string value, or an integer value, or an object. In NQP we can write things like this:

 $scalar := "This is a string"
 $x      := 123
 $pi     := 3.1415      # rounding

Wait a minute, what's that weird := symbol? Why don't we just use the plain old vanilla = sign? The problem is that NQP doesn't have it. Remember how we mentioned that NQP was a minimal subset or Perl 6? The := operator is the bind operator, that makes one value an alias C programmers and the like may call it a "reference" for another. In most cases you can ignore the distinction between the two, but be warned that it's not a regular variable assignment.

With hashes and arrays, it might be tempting to do a list assignment like we've all grown familiar with in Perl 5 and other dynamic languages:

 @small_integers := (1, 2, 3, 4);                        # WRONG!
 %leading_ladies := ("Leia" => "Starwars",
                    "Trillian" => "Hitchhikers Guide"); # WRONG!

Here's another little gotcha, NQP doesn't have list or hash context! If it's necessary to initialize a whole list at once, you can write:

 @small_integers[0] := 1;
 @small_integers[1] := 2;
 # ... And so on, and so forth ...

It's also possible to assign a list in scalar context as follows:

 $array_but_a_scalar := (1, 2, 3, 4)

Remember how we said NQP was a bare-bones subset of Perl 6? If NQP had too many features, people would use it instead of Perl 6!

Calling Actions From Rules

When talking about grammar rules, we discussed the funny little {*} symbol that calls an action. The action in question is an NQP method with the same name as the rule that calls it. NQP rules can be called with two different function signatures:

 method name ($/) { ... }

And with a key:

 method name($/, $key) { ... }

Here's an example that shows how the keys are used:

 rule cavepeople {
      'Fred'  {*}    #= Caveman
    | 'Wilma' {*}    #= Cavewoman
    | 'Dino'  {*}    #= Dinosaur
 }

And here is the rule that tells us the result:

 method cavepeople($/, $key) {
    if($key eq 'Caveman') {
        say "We've found a caveman!";
    } elsif($key eq 'Cavewoman') {
        say "We've found a cavewoman!";
    } elsif($key eq 'Dinosaur') {
        say "A dinosaur isn't a caveperson at all!";
    }
 }

The key is just a string that contains whatever text is on the line after the #= symbol. If we don't have a #= we don't use a $key in our method.

The Match Object $/

The match object $/ may have a funny-looking name, but it's a data structure that's all business. It's both a hash and an array. Plus, since it's a special variable it also gets a special shortcut syntax that can be used to save a few keystrokes:

 $/('Match_item')   is the same as  $<Match_item>
 $/[0]              is the same as  $[0]

In the match object, each item in the hash is named after one of the items that we matched in the rule. So, if we have a file with input "X + 5" and a rule:

 rule introductions {
    <variable> <operator> <number>
 }

Our match object is going to look like this: $/ = ("variable" = "x", "operator" => "+", "number" => "5")>

If we have multiple values with the same name, or items with quantifiers * or + on it, those members of the match object may be arrays. So, if we have the input "A A A B B", and the following rule:

 rule letters {
    <vowel>* <consonant>*
 }

The match object will look like this (in Perl 5 syntax):

 $/ = ("vowel" => ["A", "A", "A"], "consonant" => ["B", "B"])

We can get the number of matches in each group by casting it to a scalar using the $( ) operator:

 $($<vowel>) == 3

Inline PIR

Now that we know what the match object is, we can talk about the inline PIR functionality. In a PGE rule, we can use the {{ }} double curly brackets to go into inline-PIR mode. Inside these brackets arbitrary PIR code can be executed to affect the operation of the parser. We can access the variable $/ directly in the grammar without having to jump into NQP, and actually examine and affect the values in it.

PAST Nodes

The job of NQP is to make abstract syntax trees, and the PCT implementation of syntax trees is implemented in the PAST class. There are many different types of objects in the PAST class, each of which represents a particular program construct. These constructs are relatively common and simple, but there are powerful levels of configuration that allow complicated programming structures to be represented.

Making Trees

Every action has the ability to create a PAST node that represents that action and additional PAST nodes, that are children of that node. Calling the make command on that node adds it into the growing PAST tree that PCT maintains. Once the TOP rule matches successfully and returns, PCT takes that tree and starts the process of optimizing it and converting it into PIR and PBC code for execution.