Parrot Compiler Tools

So far we've talked a lot about low-level Parrot programming with PIR. However, the true power of Parrot is its ability to host programs written in high level languages such as Perl 6, Python, Ruby, Tcl, and PHP. In order to write code in these languages developers need there to be compilers that convert from them into PIR or Parrot bytecode so that they can be executed by Parrot. This process is analogous to how traditional compilers convert high level languages into assembly language or machine code for later assembly or direct execution. However, instead of compiling to the machine code for a particular hardware platform, Parrot's language compilers output platform independent Parrot code that run on the virtual machine. Parrot's suite of compiler tools perform all the necessary steps to make this conversion possible: Lexical analysis, parsing, optimization, resource allocation and code generation. When we say things like "Lexical Analysis" and "Parsing", people who have worked on compilers before may be anticipating that they will have to write these tools using lex or yacc. The Parrot team is proud to say that this is not the case: Parrot's solutions to these problems are much nicer then that.

Instead of traditional lexical analyzers and parser-generators that have been the mainstay of compiler designers for decades, Parrot uses an advanced set of parsing tools called the Parrot Compiler Tools (PCT). PCT uses a subset of the Perl 6 programming language called Not Quite Perl (NQP) and an implementation of the Perl 6 Grammar Engine (PGE) to build compilers for Parrot. We will talk about these in depth in chapters CHP-5 Chapter 5 PGE and CHP-6 Chapter 6 NQP. Instead of using traditional low-level languages to write compilers, we can use a modern dynamic language like Perl 6 to write them instead. As a note of interest this means that the Perl 6 compiler on Parrot is itself being written in Perl 6. This is a mind-boggling process known as bootstrapping.

The language-neutrality of the interpreter was a conscious design decision. In the early days of Parrot development the Parrot and Perl 6 projects were closely intertwined and it would have been easy for the two to overlap and intermingle throughout. However, by keeping the two projects separate and encapsulated the codebase became cleaner and more managable and the door was opened to support a whole host of other dynamic languages equally well. This modular design also benefits future language designers, not just designers of current languages. Instead of targeting tools like lex/yacc and having to reimplement low-level features such as garbage collection and dynamic data types, language designers and compiler implementers can leave the details to Parrot and focus on the high-level features of their language instead: syntax, libraries, capabilities. Parrot implements all the necessary infrastructure and exposes a rich interface that all programming languages can make use of. In fact, since Parrot aims to support a wide variety of these languages, it provides more features then any one of them would need.

For the benefit of it's high-level languages, Parrot supports a number of important features: A robust exceptions system, compilation into platform-independent bytecode, a clean extension and embedding interface, just-in-time compilation to machine code, native library interface mechanisms, garbage collection, support for objects and classes, and a robust concurrency model. Parrot provides all of these things and more that compiler designers can use immediately without having to develop their own versions of these from the ground up. Designing a new language or implementing a new compiler for an old language are easier and faster projects then anybody would expect them to be.

Language interoperability is a core goal for Parrot. Different languages are suited to different tasks, and picking which language to use in a large software project is a common planning problem. There's rarely a perfect fit, at least not for all individual parts of large complex projects. Developers often find themselves settling for one particular language because it has the fewest disadvantages from among the alternatives. Instead of forcing people to use just one for all parts like this, Parrot provides the ability to easily and seamlessly combine multiple languages within a single project. This opens up the potential to use well-tested libraries from one language, take advantage of clean problem-domain expression in a second, while binding these parts together in a third that elegantly captures the overall architecture. It's about using languages according to their inherent strengths, and mitigating the costs of their weaknesses.

PCT Overview

The Parrot Compiler Tools (PCT) are a collection of tools and classes which handle the creation of a compiler and driver program for a high-level language on Parrot. Many of these tools were originally created by the Perl 6 development team to help with the development of their compiler project. However, PCT is used by compiler projects for many different languages to great effect. Most developers would agree that writing a compiler using Perl 6 syntax and dynamic language tools is much nicer then having to write them in C, lex, and yacc. More then 40 years after these venerable tools were first created, we think we finally have a superior way to generate compilers. Read on, and we think you will agree.

PCT is composed of several classes that are used to implement various parts of a compiler. These classes are subclassed by your compiler to fill in the languages-specific details that your language requires. The PCT::HLLCompiler class specifies the interface for the compiler and implements the compiler object that is used at runtime to parse and execute code. The PCT::Grammar and PCT::Grammar::Actions classes are used to create the parser and syntax tree generator, respectively. Creating a new HLL compiler is as easy as subclassing these three entities with methods specific to your language.

Grammars and Action Files

Creating a compiler using PCT requires three basic files: The main entry point file, the grammar specification file and the grammar actions file. In addition, compilers and the languages they implement often utilize large libaries of built-in routines to help support compile-time and runtime semantics.

PCT allows a customizable workflow, but the basic elements are simple. The source code of the high level language is passed into the grammar engine which parses it and returns a special Match object that represents a pattern in the code. This match object is passed to the actions methods, which convert the match into a PAST tree. PCT then takes the PAST tree nd uses it to generate PIR code which can be saved to a file, converted to bytecode, or executed directly.

mk_language_shell.pl

The only way creating a new language compiler could be easier is if these files created themselves. Luckily for us PCT includes a tool for automatically generating a new compiler project: mk_language_shell.pl. This program automatically creates a new directory in languages/ for your new language, it creates the three files we mentioned above, it creates starter files for libraries, it creates a makefile to automate the build process, and it creates a basic test harness for performing TAP-based unit testing. All of these are demonstrative stubs and will obviously need to be edited furiously or even completely overwritten, but they give a good idea of what is needed to start on development of the compiler. With a single command though, you can create a working compiler, albeit one for a very limited example language. From there, it's up to you to fill in all the details.

mk_language_shell.pl is designed to be run from within the Parrot repository file structure. You pass it on the command line the name of the new project to create. There are no real rules about this, but we do have some guidlines to keep things flowing smoothly. Typically a new implementation of an existing language is given a special project name, not the name of the language itself. Consider the way Perl 5 distributions are named things like "Active Perl" or "Strawberry Perl", or how Python distributions might be "IronPython" or "VPython". So a Ruby-on-Parrot compiler wouldn't be called "ruby", we would use an implementation name like cardinal. The TCL compiler on Parrot is likewise called partcl, not just "tcl". Some languages take the convention of adding the prefix "par-" to their language name, and others try to come up with a name that is the name of a bird. These are just some fun possibilities, not limitations of any sort. If you are implementing an entirely new language, it might be a good idea to just name your project after the language you are implementing. Let other implementations come up with creative project names for their work.

From the Parrot directory, you invoke mk_language_shell.pl like this:

  cd languages/
  perl ../tools/build/mk_language_shell.pl <project name>

It will create all the files we described and then you can get to work on your new compiler.

Parsing Fundamentals

Compilers typically consist of at least three components that we've mentioned already: The lexical analyzer, the parser, and the code generator . The lexical analyzer converts the HLL input file into individual tokens. A token may consist of an individual punctuation mark("+"), an identifier ("myVar"), a keyword ("while"), or any other artifact that cannot be sensibly broken down into smalle parts. The parser takes a stream of these input tokens, and attempts to match them against a given pattern, or grammar. The matching process orders the input tokens into an abstract syntax tree, which is a form that the computer can easily work with. The AST is passed to the code generator which converts it into code of the target language. For something like the GCC C compiler, the target language is machine code. For PCT and Parrot, the target languages are PIR and PBC.

Parsers come in two general varieties: Top-down and bottom-up. Top-down parsers start with a top-level rule, a rule which is supposed to represent the entire input. It attempts to match various combination of subrules until the entire input is matched. Bottom-down parsers, on the other hand, start with individual tokens from the lexical analyzer and attempt to combine them together into larger and larger patterns until they produce a top-level token.

PGE itself is one of a class of parsers called a top-down parser, although it also contains a bottom-up operator precedence parser, for things like mathematical expressions where bottom-up methods are more efficient. We'll discuss both algorithms and the ways PGE switches between the two in the next chapter on PGE. An in-depth discussion of the various parsing algorithms is well beyond the scope of this book, but we will try to give a coherent overview that will get new compiler writers started quickly.

Driver Programs

The driver program for the new compiler must create instances of the various necessary classes that run the parser. It must also include the standard function libraries, create global variables, and handle commandline options. Most commandline options are handled by PCT, but there are some behaviors that the driver program will want to override.

PCT programs can, by default, be run in two ways: Interactive mode, which is run one statement at a time in the console, and file mode which loads and runs an entire file at once. For interactive mode, it is necessary to specify information about the prompt that's used and the environment that's expected. Help and error messages need to be written for the user too.

HLLCompiler class

The HLLCompiler class is a class that implements a compiler object. The compiler object contains references to parser grammar and actions files, it lets you specify the steps involved in the compilation process, and also implements some basic functionality that a compiler needs to provide. Let's take a look at a bare-bones main file, like the one that would be created by mk_language_shell.pl:

  .sub 'onload' :anon :load :init
      load_bytecode 'PCT.pbc'
      $P0 = get_hll_global ['PCT'], 'HLLCompiler'
      $P1 = $P0.'new'()
      $P1.'language'('MyCompiler')
      $P1.'parsegrammar'('MyCompiler::Grammar')
      $P1.'parseactions'('MyCompiler::Grammar::Actions')
  .end

  .sub 'main' :main
      .param pmc args
      $P0 = compreg 'MyCompiler'
      $P1 = $P0.'command_line'(args)
  .end

This basic driver consists of two parts. The first is an :onload function that creates the driver object as an instance of HLLCompiler, sets the necessary options, and registers the compiler with Parrot. The :main function is where parsing and execution begin. It calls the compreg opcode to retrieve the registered compiler object for the language "MyCompiler" and invokes that compiler object using the options received from the commandline.

It's worth noting here that the compreg opcode can be used more then once in a program for different languages. You can create multiple instances of a compiler object for a single language (such as for runtime eval) or you can create compiler objects for multiple languages for easy interoperability. The Rakudo Perl 6 eval function uses exactly this mechanism to allow runtime eval of code snippets in other languages for instance:

  eval("...", :lang<Ruby>);

HLLCompiler methods

We saw several methods of the HLLCompiler method in the example above: language, parsegrammar, and parseactions. These all need to be called for a new compiler, and should be treated as a bare minimum interface to use. The language method takes a string argument that is the name of the compiler. The HLLCompiler object will use this name to register the compiler object with Parrot so that it can be retrieved later. The parsegrammar method is used to create a reference to the grammar file that you write with PGE. The parseactions method takes the class name of the NQP file used to create the AST-generator for the compiler. There are several other methods that can be used as well:

=item* commandline_prompt

The commandline_prompt method allows you to specify a custom prompt to be used in interactive mode.

=item* commandline_banner

The commandline_banner method allows you to specify a banner message that is displayed once when the compiler is executed in interactive mode.

HLLCompiler has other methods as well that are being developed and tested but these are the most important ones for now.