parrotcode: Frequently Asked Questions | |
Contents | IMCC |
PIR and Parrot Programming for Compiler Developers - Frequently Asked Questions
Wrong FAQ, start with the Parrot FAQ first. Then come back here because this is where the fun is.
The Parrot FAQ : http://www.parrotcode.org/faq/
IMC stands for Intermediate Code.
IMCC stands for Intermediate Code Compiler.
Most of the time you will encounter the term PIR which is for Parrot Intermediate Representation and means the same thing as IMC.
PIR files use the extension .pir
.
PIR is an intermediate language that compiles either directly to Parrot Byte code. It is a possible target language for compilers targeting the Parrot Virtual Machine. PIR is halfway between a High Level Language (HLL) and Parrot Assembly (PASM).
IMCC is the current implementation of the PIR language. A PGE-based implementation can be found in languages/PIR. A completely handwritten, recursive-descent parser in C can be found in compilers/pirc. Both the PGE-based parser and pirc are a bit different, as it is very difficult to implement the exact language that IMCC implements. Note too, that they are merely parsers, and not finished compilers.
IMCC was a toy compiler written by Melvin Smith as a little 2-week experiment for another toy language, Cola. It was not originally a part of Parrot, and understandably wasn't designed for public consumption. Parrot's early alpha versions (0.0.6 and earlier) included only the raw Parrot assembler that compiled Parrot Assembly language. This was considered the reference assembler. The Cola compiler, on the other hand, targeted its own little back end compiler that included a register allocator, basic block tracking and medium level expression parsing. The backend compiler was eventually named IMCC and benefited from contributions from Angel Faus, Leo Toetsch, Steve Fink and Sean O'Rourke. The first version of Perl6 written by Sean used IMCC as its backend and that's how it currently exists.
Leopold Toetsch added, among many other things, the ability for IMCC to compile PASM by proxying any instructions that were not valid IMCC through to be assembled as PASM. This was a great improvement. As Parrot's calling convention changed to a continuation style (PCC), and generally became more complex, the PASM instructions required to call or declare subroutines became just as complex. IMCC abstracted some of the convention and eventually the core team stopped using the old reference assembler altogether. Leo integrated IMCC into Parrot and now IMCC is the front-end for the Parrot VM.
Static languages, such as Java, can run on VMs that are dedicated to execution of pre-compiled byte code with no problems. Languages such as Perl, Ruby and Python are not so static. They have support for runtime evaluation and compilation and their parsers are always available. These languages run on their own "dynamic" interpreters.
Since Parrot is specialized to be a dynamic VM, it must be able to compile code on the fly. For this reason, IMCC is written in C and integrated into the VM. IMCC is fast since it does very little type checking, and since most of Parrot's ops are polymorphic, IMCC punts most of the type checking and method dispatch to runtime. This allows extremely fast compile times, which is what scripters need.
PASM is an assembly language, raw and low-level. PASM does exactly what you say, and each PASM instruction represents a single VM opcode. Assembly language can be tough to debug, simply due to the amount of instructions that a high-level compiler generates for a given construct. Assembly language typically has no concept of basic blocks, namespaces, variable tracking, etc. You must track your register usage and take care of saving/restoring values in cases where you run out of registers. This is called spilling.
PIR is medium level and a bit more friendly to write or debug. IMCC also has a builtin register allocator and spiller. PIR has the concept of a "subroutine" unit, complete with local variables and high-level sub call syntax. PIR also allows unlimited symbolic registers. It will take care of assigning the appropriate register to your variables and will usually find the most efficient mapping so as to use as few registers as possible for a given piece of code. If you use more registers than are currently available, IMCC will generate instructions to save/restore (spill) the registers for you. This is a significant piece of every compiler.
While it is possible to write more efficient code by hand directly in PASM, it is rare. PIR is still very close to PASM as far as granularity. It is also common for IMCC to generate instructions that use less registers than handwritten PASM. This is good for cache performance.
Several reasons. PIR is so much easier to read, understand and debug. When passing snippets back and forth on the Parrot internals list, IMC is preferred since the code is much shorter than the equivalent PASM. In some cases it is necessary to debug the PASM code as bugs in IMCC are found.
Hand writing and debugging of code aside, most PIR code will be mostly compiler generated. In this respect, the most important technical reason to use PIR is the amount of abstraction it provides. PIR now completely hides the Parrot calling conventions. This allows Parrot to change somewhat without impacting existing compilers. The workload is balanced between the IMCC team and the compiler authors. The term "modular" springs to mind.
Since development on the old assembler has stopped, IMCC will be the best way to compile bytecode classes complete with metadata and externally linkable symbols. It will still be possible to construct classes on the fly with PASM, but PIR's higher level directives allow it to do compile time construction of certain things and pack them into the bytecode in a way that does not have an equivalent set of Parrot instructions. The PASM assembler may or may not ever catch up with these features.
Yes, preferably using the PCT, the Parrot Compiler Toolkit.
Not yet. IMCC is currently tightly integrated to the Parrot bytecode format. An old idea is to rework IMCC's modularity to make it easy to run separately, but this is not a top priority since IMCC currently only targets Parrot. Eventually IMCC will contain a config option to build without linking the Parrot VM, but IMCC must be able to do lookups of opcodes so it will require some sort of static opcode metadata.
The basic block of execution of an IMC program is the subroutine. Subs can be simple, with no arguments or returns. Line comments are allowed in IMC using #.
# Hello world .sub main :main print "Hello world.\n" .end
For more examples see the PIR tutorial in examples/tutorial.
Parrot uses the filename extension to detect whether the file is a PIR file (.pir), a Parrot Assembly file (.pasm) or a pre-compiled bytecode file (.pbc).
parrot hello.pir
Use the -o option for Parrot. You can provide an output filename, or the - character which indicates standard output. If the filename has a .pbc extension, IMCC will compile the module and assemble it to bytecode.
Beware that compiling to PASM is not well supported and might produce broken PASM.
Examples:
parrot -o hello.pasm hello.pir
parrot -o hello.pbc hello.pir
parrot -o - hello.pir
No, and it shouldn't. PIR is an intermediate language for compiling high level languages. Interpolation (print "$count items") is a high level concept and the specifics are unique to each language. Perl6 already does interpolation without special support from IMCC.
PIR has 2 classes of variables, symbolic registers and named variables. Both are mapped to real registers, but there are a few minor differences. Named variables must be declared. They may be global or local, and may be qualified by a namespace. Symbolic registers, on the other hand, do not need declaration, but their scope never extends outside of a subroutine unit. Symbolic registers basically give compiler front ends an easy way to generate code from their parse trees or abstract syntax tree (AST). To generate expressions compilers have to create temporaries.
Symbolic registers have a $ sign for the first character, have a single letter representing the register type [S(tring), N(umber), I(nteger) or P(MC)] for the second character, and one or more digits for the rest.
Example:
$S1 = "hiya" $S2 = $S1 . "mel" $I1 = 1 + 2 $I2 = $I1 * 3
This example uses symbolic STRING and INTVAL registers as temporaries. This is the typical sort of code that compilers generate from the syntax tree.
Named variables are either local or namespace qualified. Currently IMCC only supports locals transparently. However, globals are supported with explicit syntax. The way to declare locals in a subroutine is with the .local directive. The .local directive also requires a type (int, num, string or pmc).
Example:
.sub _main :main .local int i .local num n i = 7 n = 5.003 .end
You can't. You can explicitly create global variables at runtime, however, but it only works for PMC types, like so:
.sub _main :main .local pmc i .local pmc j i = new 'Integer' i = 123 # Create the global global "i" = i # Refer to the global j = global "i" .end
Happy Hacking.
|