Parrot Compiler Tools

So far we've talked a lot about low-level Parrot programming with PIR and PASM. However, the true power of Parrot is its ability to host programs written in high level languages such as Perl 6, Python, Ruby, Tcl, and PHP. In order to write code in these languages developers need there to be compilers that convert from the language into PIR or PASM (or even directly convert to Parrot Bytecode). People who have worked on compilers before may be anticipating us to use terms like "Lex and Yacc" here, but we promise that we won't.

Instead of traditional lexical analyzers and parser-generators that have been the mainstay of compiler designers for decades, Parrot uses an advanced set of parsing tools called the Parrot Compiler Tools (PCT). PCT uses a subset of the Perl 6 programming language called Not Quite Perl (NQP) and an implementation of the Perl 6 Grammar Engine (PGE) to build compilers for Parrot. Instead of using traditional low-level languages to write compilers, we can use a modern dynamic language like Perl 6 to write it instead. On a more interesting note, this means that the Perl 6 compiler is itself being written in Perl 6, a mind-boggling process known as bootstrapping.

The language-neutrality of the interpreter is partially a design decision for modularity. Keeping the implementation independent of the syntax makes the codebase cleaner and easier to maintain. Modular design also benefits future language designers, not just designers of current languages. Instead of targeting lex/yacc and reimplementing low-level features such as garbage collection and dynamic data types, designers can leave the details to Parrot and focus on the high-level features of their language: syntax, libraries, capabilities. Parrot does all the necessary bookkeeping, exposing a rich interface with capabilities that few languages can make full use of.

A robust exceptions system, a capability to compile into platform-independent bytecode, and a clean extension and embedding mechanism would be just some of the necessary and standard features.

Since Parrot would support the features of the major dynamic languages and wasn't biased to a particular syntax, it could run all these languages with little additional effort.

Language interoperability is another core goal. Different languages are suited to different tasks, and picking which language to use in a large software project is a common planning problem. There's never a perfect fit, at least not for all jobs. Developers find themselves settling for the language with the most advantages and the least noticeable disadvantages. The ability to easily combine multiple languages within a single project opens up the potential of using well-tested libraries from one language, taking advantage of clean problem-domain expression in a second, while binding it together in a third that elegantly captures the overall architecture. It's about using languages according to their inherent strengths, and mitigating the cost of their weaknesses.

PCT Overview

PCT is a collection of classes which handle the creation of a compiler and driver program for a high-level language. The PCT::HLLCompiler class handles building the compiler front end while the PCT::Grammar and PCT::Grammar::Actions classes handle building the parser and lexical analyzer. Creating a new HLL compiler is as easy as subclassing these three entities with methods specific to that high-level language.

Grammars and Action Files

Creating a compiler using PCT requires three basic files, plus any additional files needed to implement the languages logic and library:

make_language_shell.pl

The Parrot repository contains a number of helpful utilities for doing some common development and building tasks with Parrot. Many of these utilities are currently written in Perl 5, though some run on Parrot directly, and in future releases more will be migrated to Parrot.

One of the tools of use to new compiler designers and language implementers is make_language_shell.pl. make_language_shell.pl is a tool for automatically creating all the necessary stub files for creating a new compiler for Parrot. It generates the driver file, parser grammar and actions files, builtin functions stub file, makefile, and test harness. All of these are demonstrative stubs and will obviously need to be edited furiously or even completely overwritten, but they give a good idea of what is needed to start on development of the compiler.

make_language_shell.pl is designed to be run from within the Parrot repository file structure. It creates a subfolder in /languages/ with the name of your new language implementation. Typically a new implementation of an existing language is not simply named after the language, but is given some other descriptive name to let users know it is only one implementation available. Consider the way Perl 5 distributions are named things like "Active Perl" or "Strawberry Perl", or how Python distributions might be "IronPython" or "VPython". If, on the other hand, you are implementing an entirely new language, you don't need to give it a fancy distribution name.

Parsing Fundamentals

Compilers typically consist of three components: The lexical analyzer, the parser, and the code generator This is an oversimplification, compilers also may have semantic analyzers, symbol tables, optimizers, preprocessors, data flow analyzers, dependency analyzers, and resource allocators, among other components. All these things are internal to Parrot and aren't the concern of the compiler implementer. Plus, these are all well beyond the scope of this book. The lexical analyzer converts the HLL input file into individual tokens. A token may consist of an individual punctuation mark("+"), an identifier ("myVar"), or a keyword ("while"), or any other artifact that cannot be sensibly broken down. The parser takes a stream of these input tokens, and attempts to match them against a given pattern, or grammar. The matching process orders the input tokens into an abstract syntax tree (AST), which is a form that the computer can easily work with. This AST is passed to the code generator which converts it into code of the target language. For something like the GCC C compiler, the target language is machine code. For PCT and Parrot, the target language is PIR and PBC.

Parsers come in two general varieties: Top-down and bottom-up. Top-down parsers start with a top-level rule, a rule which is supposed to represent the entire input. It attempts to match various combination of subrules until the entire input is matched. Bottom-down parsers, on the other hand, start with individual tokens from the lexical analyzer and attempt to combine them together into larger and larger patterns until they produce a top-level token.

PGE itself is a top-down parser, although it also contains a bottom-up operator precedence parser, for things like mathematical expressions where bottom-up methods are more efficient. We'll discuss both, and the methods for switching between the two, throughout this chapter.

Driver Programs

The driver program for the new compiler must create instances of the various necessary classes that run the parser. It must also include the standard function libraries, create global variables, and handle commandline options. Most commandline options are handled by PCT, but there are some behaviors that the driver program will want to override.

PCT programs can, by default, be run in two ways: Interactive mode, which is run one statement at a time in the console, and file mode which loads and runs an entire file. For interactive mode, it is necessary to specify information about the prompt that's used and the environment that's expected. Help and error messages need to be written for the user too.

HLLCompiler class