parrotcode: Untitled | |
Contents | Language Implementations | .Net |
The .NET CLI is a stack based virtual machine whereas Parrot is register based. This document details what that means and how the challenges the difference presents will be dealt with.
load r0, SOME_ADDRESS
load r1, SOME_OTHER_ADDRESS
add r2, r0, r1
store r2, YET_ANOTHER_ADDRESS
Here two values are being loaded from memory into registers r0 and r1. The add operation is adding the contents of registers r0 and r1 and storing it in r2. Finally, the contents of r2 is being stored in memory.
Just as some hardware CPUs have different types of registers (often one set for integers and another for floating point numbers), Parrot has 4 register types. Integer (I) registers hold native integers. Number (N) registers hold native floating point numbers, usually an IEEE 756 Double on platforms where this is supported. String (S) registers hold references to strings. PMC (P) registers hold references to PMCs (basically, any more complex type).
In PASM (Parrot Assembly Language), registers are denoted I0, I1, I2, etc. The numbers index into the array of registers of the given type.
set S0, "Hello world!\n"
print S0
end
In PIR (Parrot Intermediate Representation), virtual and named registers are introduced. Virtual registers are denoted as $I0, $I1, $I2, etc.
.sub main :main
set $S0, "Hello world!\n"
print $S0
end
.end
Named registers are introduced using the ".local" syntax.
.sub main :main
.local string hello
set hello, "Hello world!\n"
print hello
end
.end
PIR is a thin layer on top of PASM. The virtual registers and named registers get mapped to real Parrot registers, at the time of writing using a graph coloring algorithm.
Unlike hardware CPUs, where a register is a physical area of storage and there are a fixed number of these, Parrot's registers are simply an array in memory. Initially Parrot mirrored this and provided a fixed number (32) of registers of each type. If PIR programs needed more registers than were available, P31 was reserved to hold a spilling array or stack. Later it was realized that there was no reason not to have variable sized register frames (a register frame being the set of registers corresponding to the currently executed unit of code e.g. a subroutine, closure or coroutine). This eliminates the expense of spilling and simplified register allocation greatly. Also, it means that less memory is consumed by register frames for small functions.
This translator emits PIR. This means that it can freely use virtual registers as needed and rely on the PIR compiler to minimize the number of real registers that are needed. This still leaves a lot of work to do in converting stack code into good register code, but simplifies the problem by taking care of register allocation and freeing the translator from having to do that.
push A # Stack contains 1 item: A
push B # Stack contains 2 items: A and B
push C # Stack contains 3 items: A, B and C
mul # Stack contains 2 items: A and B*C
add # Stack contains 1 item: A*B + C
In practice the push instructions would actually be instructions to load a local variable or argument of the current method, a field of the current object/class or some other global data. Instructions exist to store data on the top of the stack back to these locations once computation has taken place.
The stack locations as used in the above example are "untyped". In the above example A, B and C may have been integers. This code could be followed by identical code apart from using D, E and F which could be floating point numbers, for example. Stack positions that previous held integers are now used to hold floating point numbers. Data flow analysis can be used to find the types of items currently on the stack as needed. In a sense, the locations are transitionally typed.
The .NET CLI, while supporting a number of data types, both different sizes and signings of integers as well as floating point numbers and more complex reference types, only has four possible data types that may be stored on the stack itself. These are 32-bit and 64-bit integers, native integers and a floating point data type. Smaller integer types are widened and narrowed as required on load and store.
The .NET CLI places an additional constraint of the instruction stream. The specification, in Partition I Clause 12.3.2.1, states that "the type state of the stack (the stack depth and types of each element on the stack) at any given point in a program shall be identical for all possible control flow paths". Therefore, a program like the following one would be forbidden.
if A # if operation skips the next instruction if A is false
push X
...
if A
pop X
In the region of instructions marked "..." the stack depth is unknown; it depends on whether A is true or false. The two control flow paths that are possible upon reaching "..." do not result in identical stack type states. The property that makes code like this invalid means that the stack pointer can always be determined statically; that is, at translation time.
.sub fact
# Get input parameter.
.param int n
# return (n > 1 ? n * _fact(n - 1) : 1)
.local int result
if n > 1 goto recurse
result = 1
goto return
recurse:
$I0 = n - 1
result = fact($I0)
result *= n
return:
.return (result)
.end
Calling a function is simply written as it might be in a high level language - the function name with arguments in parentheses. Parameters are retrieved by the ".param" directive. Each ".param" is stored in a named virtual register. .NET CLI code always retrieves parameters by number, the argument number being a constant and thus known at translation time. Therefore, a naming scheme that has the first parameter named "arg0", the second parameter named "arg1" etc. will be sufficient.
.local int name
.NET CLI code retrieves parameters by number, the number being a constant and thus known at translation time. Therefore, a naming scheme that has the first local named "local0", the second local named "local1" etc. will be sufficient.
Translation of loads from arguments or locals can simply be a push - a one to one instruction mapping.
push stack, arg1
push stack, local2
Translation of loads from fields involves two instructions; one to do the field access and store the instruction in a register, and another to push the contents of that register onto the stack. Store operations will have the same costs as load operations depending on the storage location. At worst, a load or store would translate to two instructions, with one in the common case.
While this sounds reasonable so far, the real weakness of this approach is made clear when operations are being performed. For example, an addition instruction under this scheme becomes 4 register operations.
pop stack, $I0
pop stack, $I1
add $I0, $I1
push stack, $I0
Maintaining a stack has overhead too; under the hood it translates to doing v-table calls to the push and pop method of an array PMC. This will adversely affect performance beyond simply having more operations to dispatch (or JIT).
The whole sequence of pushing a local and an arguemnt onto the stack, adding them together and storing the result in another local variable (4 instructions in stack code) can actually become a single register instruction.
add local3, arg1, local2
Where local3 is the destination for the result and arg1 and local2 are the operands for the add instruction.
STACK CODE LAZY STACK REGISTER CODE
LOOP: LOOP:
ldlocal 1 local1
ldarg 1 local1, arg1
if_ge END if local1 >= arg1 goto END
ldlocal 2 local2
ldarg 2 local2, arg2
add
stlocal 2 --> LOOKAHEAD add local2, local2, arg2
ldlocal 1 local1
iconst 1 local1, 1
add
stlocal 1 --> LOOKAHEAD add local1, local1, 1
br LOOP goto LOOP
END: END:
ldlocal 2 local2
ret .return(local2)
In this example, the register code produced is very good - as good as one could produce by hand. However, it still requires a stack when a result is used by a non-store instruction.
STACK CODE LAZY STACK REGISTER CODE
ldarg 1 arg1
ldarg 2 arg1, arg2
ldarg 3 arg3
add arg1 add $I0, arg2, arg3
push stack, $I0
mul arg1 pop stack, $I0
mul local1, arg1, $I0
stlocal 1 --> LOOKAHEAD
In addition to this, without analysing the basic blocks[2] that make up the code to determine when it can be avoided, it is necesary to push all cached items onto the stack before a branch or a conditional branch takes place or a label is found. While this and a full set of optimizations to minimize stack use are possible, this path will not be explored. Given that there are essentially unlimited registers available to use, it would be desirable to completely do away with having a stack. Doing stack operations costs v-table calls, which harms performance and makes poor use of the Parrot execution model.
This approach means that each stack instruction becomes one register instruction, provided there is a single register instruction with equivalent semantics to the stack instruction (that is, some .NET instructions may map to multiple Parrot instructions independent of the stack to register translation method used).
The following example demonstrates the translation of some stack code to equivalent register code for a trivial example. Note that the stack pointer is internal to the translator and simply denotes which virtual registers hold the operands. Note also that the "set" instruction is how Parrot spells "move".
STACK CODE REGISTER CODE STACK POINTER (BEFORE/AFTER)
LOOP: LOOP: 0 / 0
ldlocal 1 set $I0, local1 0 / 1
ldarg 1 set $I1, arg1 1 / 2
if_ge END if $I0 >= $I1 goto END 2 / 0
ldlocal 2 set $I0, local2 0 / 1
ldarg 2 set $I0, arg2 1 / 2
add add $I0, $I0, $I1 2 / 1
stlocal 2 set local1, $I0 1 / 0
ldlocal 1 set $I0, local1 0 / 1
iconst 1 set $I1, 1 1 / 2
add add $I0, $I0, $I1 2 / 1
stlocal 1 set local1, $I0 1 / 0
br LOOP goto LOOP 0 / 0
END: END: 0 / 0
ldlocal 2 set $I0, local2 0 / 1
ret .return($I0) 1 / 0
Note that this is a massive simplification; it is only possible to translate instructions in the order they appear if there is always nothing on the stack whenever a branch takes place. In general this is not true, and must be resolved by identifying basic blocks and propogating the stack position at a jump to the block along with the block, so when translation of that basic block takes place the translator knows what the stack pointer is.
The code that has been produced here is far from optimal, however it only uses registers, not a stack, which is a move in the right direction. Also, it is now possible for a register code optimizer to improve it.
PIR is textual in nature, and the less of it that needs to be parsed and compiled, the better. While copy propogation as a post-translation optimization could be implemented within the translator, it would likely require an intermediate form for the translated code to avoid a lot of expensive text manipulation operations. It would be desirable to just never generate the moves that would be removed, just as instructions were never generated when this was clearly avoidable in the lazy pushes/pops algorithm presented earlier.
XXX TO DO
[1] Java Bytecode to Native Code Translation: The Caffeine Prototype And Preliminary Results http://portal.acm.org/ft_gateway.cfm?id=243864&type=pdf Performance issues with imitating the stack, suggestion of mapping stack locations to registers and additional problems with that.
[2] Modern Compiler Design Book Description of basic blocks.
[3] The Case For Virtual Register Machines
|