Episode 2: Poking in Compiler Guts

Introduction

In the first episode, we introduced the Parrot Compiler Tools, and generated a very simple language using a shell script provided with the Parrot distribution. We also announced Squaak, a simple programming language developed specially for this tutorial. Squaak will be the case study to show how PCT can be used as a very effective set of tools to implement a language for Parrot. A list of features of Squaak was specified. If you felt lucky, you might even have tried to do the exercise at the end of the previous episode.

In this episode, we will take a closer look at the generated compiler. We shall check out the different stages of the compilation process, and show what's going on in PCT-based compilers.

Under the Hood

Remember how we invoked our compiler in the previous episode? We can pass a file, or invoke the compiler without a command line argument, in which case our compiler enters the interactive mode. Consider the first case, passing the file test.sq, just as we did before:

 $ ./installable_squaak test.sq

When invoking our compiler like this, the file test.sq is compiled and the generated code (bytecode) is executed immediately by Parrot. How does this work, you might wonder. The interpretation of a script is done through a series of transformations, starting at the script source and ending in a format that can be executed by Parrot. Compilers built with the PCT (based on the HLLCompiler class) can take a target option, to show one of the intermediate representations. This option can have the following values, corresponding to the four default compilation phases of an HLLCompiler object:

This is an example of using the target option set to "parse", which will print the parse tree of the input to stdout:

 $ ./installable_squaak --target=parse test.sq

In interactive mode, giving this input:

 say 42;

will print this parse tree (without the line numbers):

  1 "parse" => PMC 'Regex;Match' => "say 42;\n" @ 0 {
  2    <statementlist> => PMC 'Regex;Match' => "say 42;\n" @ 0 {
  3         <statement> => ResizablePMCArray (size:1) [
  4             PMC 'Regex;Match' => "say 42" @ 0 {
  5                 <statement_control> => PMC 'Regex;Match' => "say 42" @ 0 {
  6                     <sym> => PMC 'Regex;Match' => "say" @ 0
  7                     <EXPR> => ResizablePMCArray (size:1) [
  8                         PMC 'Regex;Match' => "42" @ 4 {
  9                             <integer> => PMC 'Regex;Match' => "42" @ 4 {
 10                                 <VALUE> => PMC 'Regex;Match' => "42" @ 4
 11                                 <decint> => \parse[0][0]
 12                             }
 13                         }
 14                     ]
 15                 }
 16             }
 17         ]
 18     }
 19 }

When changing the value of the target option, the output changes into a different representation of the input. Why don't you try that right now?

So, a HLLCompiler object has four compilation phases: parsing, construction of a Parrot Abstract Syntax Tree (PAST), construction of a Parrot Opcode Syntax Tree (POST) and generation of Parrot Intermediate Representation (PIR). After compilation, the generated PIR is executed immediately.

If your compiler needs additional stages, you can add them to your HLLCompiler object. For Squaak, we will not need this, but for details, check out compilers/pct/src/pct/HLLCompiler.pir.

We shall now discuss each compilation phase in more detail. The first two phases, parsing the input and construction of the PAST are executed simultaneously. Therefore, these are discussed together.

Parse phase: match objects and PAST construction During the parsing phase, the input is analyzed using Perl 6's extended regular expressions, known as Rules (see Synopsis 5 for details). When a rule matches some input string, a Match object is created. A Match object is a combined array and hashtable and can be indexed by integers as well as strings. As rules typically consist of other (sub) rules, it is easy to retrieve a certain part of the match. For instance, this rule:

 rule if_statement {
     'if' <expression> 'then' <statement> 'end'
 }

has two other subrules: expression and statement. The match object for the rule if_statement represents the whole string from if to end. You can retrieve a the Match for a subrule by indexing into the Match object using the name of that subrule. For instance, to get the match for <expression>, you would use $/<expression>. (In nqp, $foo<bar> indexes into $foo using the constant string bar as a hash key.)

During the parse phase, the PAST is constructed. There is a small set of PAST node types. For instance, PAST::Var to represent variables (identifiers, such as print) and PAST::Val to represent literal values (for instance, "hello" and 42). Later we'll go through the various PAST nodes in more detail.

Now, you might wonder, at which point exactly is this PAST construction happening? At the end of a successfully matching rule, the rule's parse action is performed. Such a parse action is just a method that has the same name as the rule which triggers it (in this case: if_statement). So, during the parsing phase, several parse actions are executed, each of which builds a piece of the total PAST representing the input string.

A Parrot Abstract Syntax Tree is just a compiler-friendly tree-based representation of your program. It is convenient both for analysis and optimization, and for further transformation into a lower-level representation such as POST.

PAST to POST

After the PAST is constructed, the HLLCompiler transforms this PAST into a Parrot Opcode Syntax Tree (POST). The POST representation is also a tree structure, but these nodes are on a lower abstraction level and correspond very closely to PIR ops. For instance, the PAST node type which represents a while statement (constructed as PAST::Op.new( :pasttype('while') ) ) decomposes into several POST nodes.

The template for a while statement typically consists of a number of labels and jump instructions. On the POST level, the same while statement is represented by a set of nodes, each representing a one instruction or a label. This makes it much easier to transforn POST into executable code.

Usually, as a user of the PCT, you don't need to know details of POST nodes, which is why this will not be discussed in further detail. Use --target=post to see what a POST looks like.

POST to PIR

In the fourth (and final) stage, the POST is transformed into Parrot Intermediate Representation (PIR). As mentioned, transforming a POST into something executable is rather straightforward, as POST nodes already represent individual instructions and labels. Again, normal usage of the PCT does not require you to know any details about this transformation.

And now for the good news...

We established the general data flow of PCT-based compilers which transform plain source code into PIR through four transformations:

Source -> Parse Tree -> PAST -> POST -> PIR

Now, as you're reading this tutorial, you're probably interested in using the PCT to implement Your Favorite Language on top of Parrot. We already saw that a language grammar is expressed in Perl 6 Rules. What about the other transformations? Well, earlier in this episode we mentioned parse actions and that these actions create PAST nodes. After you have written a parse action for each grammar rule, you're done!

Say what?

That's right. Once you have correctly constructed a PAST, your compiler can generate executable PIR, which means you just implemented your first language on top of Parrot. Of course, you'll still need to implement any language specific libraries, but that's beside the point.

PCT-based compilers already know how to transform PAST into POST and how to transform POST into PIR. These transformation stages are already provided by the PCT.

What's next?

In this episode we took a closer look at the internals of a PCT-based compiler. We discussed the four compilation stages which transform an input string (a program or script, depending on your definition) into PAST, POST and finally executable PIR.

The next episodes is where the Fun Stuff is: we will be implementing Squaak for Parrot. Piece by piece, we will implement the parser and the parse actions. Finally, we'll demonstrate John Conway's "Game of Life" running on Parrot, implemented in Squaak.

Exercises

Last episode's exercise was to add a command line banner and prompt for the interactive mode of our compiler. Given the hints that were provided, it was probably not too hard to find the solution, which is shown below. This INIT block can be found in the file src/Squaak/Compiler.pm. The relevant lines are marked with a comment

  INIT {
      Squaak::Compiler.language('Squaak');
      Squaak::Compiler.parsegrammar(Squaak::Grammar);
      Squaak::Compiler.parseactions(Squaak::Actions);

      Squaak::Compiler.commandline_banner("Squaak for Parrot VM.\n"); # set banner
      Squaak::Compiler.commandline_prompt('> '); # set prompt
  }

Starting in the next episode, the exercises will be more interesting. For now, it would be useful to browse around through the source files of the compiler, and see if you understand the relation between the grammar rules in src/Squaak/Grammar.pm and the methods in src/Squaak/Actions.pm. It's also useful to experiment with the --target option described in this episode. If you don't know PIR, now is the time to do some preparation for that. There's sufficient information to be found on PIR, see the References section for details. In the mean time, if you have any suggestions, questions and whatnot, don't hesitate to leave a comment.

References

1. PIR language specification: docs/pdds/pdd19_pir.pod