TITLE

IMCC and Parrot Programming for Compiler Developers - Frequently Asked Questions

VERSION

Revision 0.1 - 03 December 2001: Initial creation as of Parrot version 0.0.13 by Melvin Smith

GENERAL QUESTIONS

What is Parrot?

Wrong FAQ, start with the Parrot FAQ first. Then come back here because this is where the fun is.

The Parrot FAQ : http://www.parrotcode.org/faq/

IMC stands for Intermediate Code; IMCC stands for Intermediate Code Compiler. You will also see the term PIR which is for Parrot Intermediate Representation and means the same as IMC, but for some each Parrot developer has his favorite term. PIR was the original term, where IMC seems to be the vernacular. It is an intermediate language that compiles either directly to Parrot Byte code, or translates to Parrot Assembly language. It is the preferred target language for compilers for the Parrot Virtual Machine. PIR is halfway between a High Level Language (HLL) and Parrot Assembly (PASM).

What is the history of IMCC?

IMCC was a toy compiler written by Melvin Smith as a little 2-week experiment for another toy language, Cola. It was not originally a part of Parrot, and understandably wasn't designed for public consumption. Parrot's early alpha versions (0.0.6 and earlier) included only the raw Parrot assembler that compiled Parrot Assembly language. This was considered the reference assembler. The Cola compiler, on the other hand, targeted its own little back end compiler that included a register allocator, basic block tracking and medium level expression parsing. The backend compiler was eventually named IMCC and benefited from contributions from Angel Faus, Leo Toetsch, Steve Fink and Sean O'Rourke. The first version of Perl6 written by Sean used IMCC as its backend and that's how it currently exists.

Leopold Toetsch added, among many other things, the ability for IMCC to compile PASM by proxying any instructions that were not valid IMCC through to be assembled as PASM. This was a great improvement. As Parrot's calling convention changed to a continuation style (PCC), and generally became more complex, the PASM instructions required to call or declare subroutines became just as complex. IMCC abstracted some of the convention and eventually the core team stopped using the old reference assembler altogether. Leo integrated IMCC into Parrot and now IMCC is _the_ front-end for the Parrot VM.

Parrot is a VM, why does it need IMCC builtin?

Static languages, such as Java, can run on VMs that are dedicated to execution of pre-compiled byte code with no problems. Languages such as Perl, Ruby and Python are not so static. They have support for runtime evaluation and compilation and their parsers are always available. These languages run on their own "dynamic" interpreters.

Since Parrot is specialized to be a dynamic VM, it must be able to compile code on the fly. For this reason, IMCC is written in C and integrated into the VM. IMCC is fast since it does very little type checking, and since most of Parrot's ops are polymorphic, IMCC punts most of the type checking and method dispatch to runtime. This allows extremely fast compile times, which is what scripters need.

How Is IMCC different than Parrot Assembly language?

PASM is an assembly language, raw and low-level. PASM does exactly what you say, and each PASM instruction represents a single VM opcode. Assembly language can be tough to debug, simply due to the amount of instructions that a high-level compiler generates for a given construct. Assembly language typically has no concept of basic blocks, namespaces, variable tracking, etc. You must track your register usage and take care of saving/restoring values in cases where you run out of registers. This is called spilling.

IMC is medium level and a bit more friendly to write or debug. IMCC also has a builtin register allocator and spiller. IMC has the concept of a "subroutine" unit, complete with local variables and high-level sub call syntax. IMCC also allows unlimited symbolic registers. It will take care of assigning the appropriate register to your variables and will usually find the most efficient mapping so as to use as few registers as possible for a given piece of code. If you use more registers than are currently available, IMCC will generate instructions to save/restore (spill) the registers for you. This is a significant piece of every compiler.

While it is possible to write more efficient code by hand directly in PASM, it is rare. IMC is still very close to PASM as far as granularity. It is also common for IMCC to generate instructions that use less registers than handwritten PASM. This is good for cache performance.

Why should I target IMC instead of PASM?

Several reasons. IMC is so much easier to read, understand and debug. When passing snippets back and forth on the Parrot internals list, IMC is preferred since the code is much shorter than the equivalent PASM. In some cases it is necessary to debug the PASM code as bugs in IMCC are found.

Hand writing and debugging of code aside, most IMC code will be mostly compiler generated. In this respect, the most important technical reason to use IMC is the amount of abstraction it provides. IMC now completely hides the Parrot calling conventions and allows different call conventions to be selected via .pragma without changes to the high-level code emitter. This allows Parrot to change somewhat without impacting existing compilers. The workload is balanced between the IMCC team and the compiler authors. The term "modular" springs to mind.

Since development on the old assembler has stopped, IMCC will be the best way to compile bytecode classes complete with metadata and externally linkable symbols. It will still be possible to construct classes on the fly with PASM, but IMC's higher level directives allow it to do compile time construction of certain things and pack them into the bytecode in a way that does not have an equivalent set of Parrot instructions. The PASM assembler may or may not ever catch up with these features.

Can I use IMCC without Parrot?

Not yet. IMCC is currently tightly integrated to the Parrot bytecode format. One goal is to rework IMCC's modularity to make it easy to run separately, but this is not a top priority since IMCC currently only targets Parrot. Eventually IMCC will contain a config option to build without linking the Parrot VM, but IMCC must be able to do lookups of opcodes so it will require some sort of static opcode metadata.

IMCC PROGRAMMING 101

Hello world?

The basic block of execution of an IMC program is the subroutine. Subs can be simple, with no arguments or returns. Line comments are allowed in IMC using #.

        # Hello world
        .sub _main
           print "Hello world.\n"
           end
        .end

How do I compile and run an IMC module?

Parrot uses the filename extension to detect whether the file is an IMC file (.imc), a Parrot Assembly file (.pasm) or a pre-compiled bytecode file (.pbc).

        parrot hello.imc

How do I see the assembly code that IMC generates?

Use the -o option for Parrot. You can provide an output filename, or the - character which indicates standard output. If the filename has a .pbc extension, IMCC will compile the module and assemble it to bytecode.

Examples:

Create the PASM source from IMC.
Compile to bytecode from IMC.
Dump PASM to screen (my favorite shortcut).

Does IMCC do variable interpolation in strings?

No, and it shouldn't. IMC is an intermediate language for compiling high level languages. Interpolation (print "$count items") is a high level concept and the specifics are unique to each language. Perl6 already does interpolation without special support from IMCC.

What are IMC variables?

IMC has 2 classes of variables, symbolic registers and named variables. Both are mapped to real registers, but there are a few minor differences. Named variables must be declared. They may be global or local, and may be qualified by a namespace. Symbolic registers, on the other hand, do not need declaration, but their scope never extends outside of a subroutine unit. Symbolics registers basically give compiler front ends an easy way to generate code from their parse trees or AST. To generate expressions compilers have to create temporaries.

Symbolic Registers (or Temporaries)

Symbolic registers have a $ sign for the first character, have single letter (S,N,I,P) for the second character, and 1 or more digits for the rest. By the 2nd character IMCC determines which set of Parrot registers it belongs to.

Example:

        $S1 = "hiya"
        $S2 = $S1 . "mel"
        $I1 = 1 + 2
        $I2 = $I1 * 3

This uses symbolic STRING and INTVAL registers as temporaries. This is the typical sort of code that compilers generate from the syntax tree.

Named Variables

Named variables are either local, global or namespace qualified. Currently IMCC only supports locals transparently, however globals are supported with explicit syntax. The way to declare locals in a subroutine is with the .local directive. The .local directive also requires a type (int, num, string or a classname such as PerlArray).

Example:

        .sub _main
           .local int i
           .local num n
           i = 7
           n = 5.003
           end
        .end

How do I declare global or package variables in IMC?

You can't yet. IMCC still lacks a few features and this is one of those features. You can explicitly create global variables at runtime, however, but currently it only works for PMC types, like so:

        .sub _main
           .local Integer i
           .local Integer j
           i = new Integer
           j = new Integer
           i = 123
           # Create the global
           global "i" = i

           # Retrieve the global
           j = global "i"
           end
        .end

Two new directives are planned for IMC.

.global

.extern

The .global directive will be orthogonal to .local. IMCC will track globals and take care of spilling just like local variables.

Theoretically, when .global is added, the above code segment will look like:

        .global Integer i = 123

        .sub _main
           .local Integer j
           j = i
           end
        .end

The global i will created and initialized during the bytecode load and other modules will be able to refer to i if they include the .extern directive like so:

        .extern Integer i
        ...

Parrot will fixup the symbol references at runtime.

IMCC ADVANCED TOPICS

How can I make a library of IMC routines and include it in other Parrot/IMC programs?

This one is very simple. Use the .include directive to include other .imc source files. Do keep in mind that currently Parrot starts execution with the first sub it sees in the bytecode; so if your main includes external .imc files you need to include them after your "main" start sub. If you .include them first (in typical C or Perl style, Parrot will execute the first sub in the first included source file. This is because .include is a preprocessed directive and simply creates one huge .imc source module.

        #############################
        # dynamic.imc
        #
        .sub _dynamic
           print "_dynamic include and compilation\n"
           .pcc_begin_return
           .pcc_end_return
           end
        .end



        #############################
        # main.imc
        #
        # dynamic compilation with .include
        #
        .sub _main
           print "_main\n"
           _dynamic()
           end
        .end

        .include "dynamic.imc"

The .include directive is not the long-term solution for working with modular bytecode, but Parrot still lacks some infrastructure for linking and running precompiled bytecodes transparently or via an import, so .include is the simplest method. The downsize is all code is compiled on the fly.

How do I precompile a bytecode library and use it in another module?

If .include just isn't good enough for you, and you want to go ahead and use precompiled bytecodes, you can, with some restrictions. You have to explicitly link the symbols at runtime. This isn't too tough, just lookup the symbol name and use the PMC you get in return. Subroutine PMCs are globals and are autoloaded by the "load_bytecode" PASM instruction.

The main restriction with current Parrot/IMCC is that you can't use the high level shortcut for calling subs, ie. _bar(a,b). Instead you have to setup the arguments and the return continuation yourself and call invoke on the Sub PMC. Soon, IMCC and Parrot will support a cleaner way of doing this.

        #######################################################
        # main.imc
        #
        # External subs example
        #
        # This way is the only way that currently works to call
        # an externally defined sub. Eventually we will support
        # "extern" symbol linkage in the bytecode loader but for
        # now you have to do it like so...
        #
        .sub _main
           .local ParrotSub fun

           # load the external bytecode lib
           load_bytecode "subs.pbc"

           # _baz()      <-- this style doesn't work yet, but it will soon

           # Instead, retrieve the global sub that was defined in subs.imc by name
           fun = global "_baz"

           # invokecc sets up the return continuation in P1 for the caller
           invokecc fun
           # Done!

           # Calling a forward declared sub in same module.
           # IMCC resolves _localsub at compile time so we can use the shortcut
           _localsub()
           end
        .end

        .sub _localsub
           print "this is localsub\n"
           end
        .end


        ##################################
        # subs.imc
        #
        # Sample extern sub library
        #
        # Compile this separately to subs.pbc
        #
        .sub _foo
           print "_foo is local to _baz\n"
           .pcc_begin_return
           .pcc_end_return
        .end

        .sub _baz
           print "this is external sub _baz\n"
           _foo()
           .pcc_begin_return
           .pcc_end_return
        .end

How can I compile classes and objects?

Parrot and IMCC do not YET support pre-compiled classes and objects. Until the support is added, you can easily achieve it, you just need to do it all dynamically.

Write support routines for new and construct and generate your constructors as simple subroutines. I suggest some sort of simple name mangling (C++/Java) when naming the subs (_TestClass__print_i). Your compiler will need to track field (or member) offsets in the array, or if you want to be even lazier and give up a tiny amount of speed, use a hash and you don't have to track them.

That sounds easy, but how would it look?

I'd be surprised if you are writing a compiler and can't figure this one out by yourself, but I'll be nice since this IS supposed to be documentation. It does help to actually see how IMCC does it.

Remember, you only have to use Parrot's builtin mechanisms if your language wishes to interact with other languages, otherwise you can implement high level constructs any way you wish. You'll have to for a while longer, anyway.

The exercise for a destructor is left up to you. You'll quickly notice that supporting destruction with Parrot's garbage collection requires a bit of support from the internals that currently just isn't there. But we plan to get there soon. By the time we DO get there, you won't need the syntax below.

An OOP Example

Hypothetical language Mython, (apologies to Python, but it sure is nice and brief):

        class TestClass:
           int i
           method init:
                i = 0
           method print_self:
                printf "%d\n", i

        sub main:
           TestClass obj1, obj2
           obj1 = new TestClass
           obj2 = new TestClass
           obj1.i = 1
           obj2.i = 2
           obj1.print_self()
           obj2.print_self()

Here is the IMCC workaround until we have explicit .class syntax and bytecode freeze/thaw.

        .sub _main
           .local pmc obj1
           .local pmc obj2
           .local pmc meth
           _init_world()                 # _init() sets up all classes and globals
           obj1 = __new("TestClass")     # obj1 = new TestClass;
           __ctor(obj1)
           obj2 = __new("TestClass")     # obj2 = new TestClass;
           __ctor(obj2)
           obj1[4] = 1                   # obj1.i = 1;
           obj2[4] = 2                   # obj1.i = 2;
           P2 = obj1                     # obj1.print_self()
           meth = obj1[3]
           invokecc meth
           P2 = obj2                     # obj2.print_self()
           meth = obj2[3]
           invokecc meth
           end
        .end

        # Called only once upon program start
        .sub _init_world
           .local SArray c
           .local Sub meth
           .local PerlString classname
           classname = new PerlString               # class TestClass:
           c = new SArray
           c = 10    # Array size 10
           global "classes::TestClass" = c
           classname = "TestClass"
           c[0] = classname
           meth = global "_TestClass_ctor"          # method init:
           c[2] = meth
           meth = global "_TestClass_print_self"    # method print_self:
           c[3] = meth
           $P100 = new PerlInt                      # int i
           c[4] = $P100
           # TestClass is now defined, instantiate away
           # Setup any other globals for program
           .pcc_begin_return
           .pcc_end_return
        .end

        # This is generic infrastructure code
        .sub __new
           .param string classname
           $S1 = "classes::" . classname
           # Instantiate a TestClass
           $P11 = global $S1
           P2 = clone $P11                          # Clone does a copy, including members
           .pcc_begin_return                        # Return the new "object"
           .return P2
           .pcc_end_return
        .end

        # This is generic infrastructure code
        .sub __ctor
           .param pmc self
           # Call TestClass constructor
           $P100 = self[2]
           P2 = self       # For method calling convention P2 is the object
           invoke $P100    # Leave P1 with ret continuation (tail chain)
           .pcc_begin_return
           .pcc_end_return
        .end

        .sub _TestClass_init            # method TestClass.init
           .local pmc self
           .local pmc classname
           .local Integer i
           self = P2
           classname = self[0]
           print classname
           print "::constructor\n"
           self[4] = 0                  # self.i = 0;
           .pcc_begin_return
           .pcc_end_return
        .end

        .sub _TestClass_print_self      # method TestClass.print_self
           .local pmc self
           .local pmc i
           self = P2
           i = self[4]                  # printf ("self.i = %d\n", i)
           print "self.i = "
           print i
           print "\n"
           .pcc_begin_return
           .pcc_end_return
        .end

Thats not all.

I have lots more to come. If you have suggestions for the FAQ or have an idea for a new feature for IMCC, please email me at melvin.smith@mindspring.com and/or hop on #parrot IRC (see the Parrot FAQ for IRC directions). I'm also on AOL Instant Messenger, handle: MrJoltCola

Happy Hacking.