[DRAFT] PDD 6: Parrot Assembly Language (PASM)

Abstract

The format of Parrot's bytecode assembly language.

Description

Parrot's bytecode can be thought of as a form of machine language for a virtual super CISC machine. It makes sense, then, to define an assembly language for it for those people who may need to generate bytecode directly, rather than indirectly through a high-level language.

{{ NOTE: out-of-date and incomplete. }}

Questions

Implementation

Parrot opcodes take the format of:

  code destination[dest_key], source1[source1_key], source2[source2_key]

The brackets do not denote optional arguments as such--they are real brackets. They may be left out entirely, however. If any argument has a key the assembler will substitute the null key for arguments missing keys.

Conditional branches take the format:

  code boolean[bool_key], true_dest

The key parameters are optional, and may be either an integer or a string. If either is passed they are associated with the parameter to their left, and are assumed to be either an array/list entry number, or a hash key. Any time a source or destination can be a PMC register, there may be a key.

Destinations for conditional branches are an integer offset from the current PC.

All registers have a type prefix of P, S, I, or N, for PMC, string, integer, and number respectively. While parrot bytecode does not have a fixed limit on the number of registers, PASM has an implementation limit on the number of addressable registers of each type, currently set at 100 (0-99).

Assembly Syntax

All assembly opcodes contain only ASCII lowercase letters, digits, and the underscore.

Assembler directives are prefixed with a dot. These directives are instructions for the assembler and may or may not translate to a PASM instruction.

Labels all end with a colon. They may have ASCII letters, numbers, and underscores in them.

Namespaces are noted with the .namespace directive. It takes a single parameter, the name of the namespace, in the form of a multi-dimensional key.

Constants can be declared with the .macro_const directive. It takes two parameters: the name of the constant and the value.

Subroutine names are noted with the .pcc_sub directive. It takes a single parameter, the name of the subroutine, which is added to the namespace's symbol table. Sub names may be any valid Unicode alphanumeric character and the underscore. The .pcc_sub directive may take flags to indicate when the sub should be invoked. The following flags are available: :main to indicate that execution should start at the specified subroutine; :immediate or :postcomp to indicate that the sub should be run immediately after compilation; :load to indicate that the sub should be executed when its bytecode segment is loaded; :init to indicate the sub should be run when the file is run directly.

Constants don't need to be named and put in a separate section of the assembly source. The assembler will take care of putting them in the appropriate part of the generated bytecode.

Below is an overview of the grammar of a PASM file.

 pasm_file:
   [ pasm_line '\n' ]*

 pasm_line:
     pasm_instruction
   | constant_directive
   | namespace_directive

 pasm_instruction:
   [ [ sub_directive ]? label ]? instruction

 sub_directive:
   ".pcc_sub" [ sub_flag ]?

 sub_flag:
   ":init" | ":main" | ":load" | ":postcomp" | ":immediate" | ":anon"

 label:
   identifier ":"

 constant_directive:
   ".macro_const" identifier literal

 namespace_directive:
   ".namespace" "[" multi_dimensional_key "]"

 multi_dimensional_key:
   quoted_string [ ";" quoted_string ]*

Opcode List

There may be multiple (but unlisted) versions of an opcode. If an opcode takes a register that might be keyed, the keyed version of the opcode has a _k suffix. If an opcode might take multiple types of registers for a single parameter, the opcode function really has a _x suffix, where x is either P, S, I, or N, depending on whether a PMC, string, integer, or numeric register is involved. The suffix isn't necessary (though not an error) as the assembler can intuit the information from the code.

In those cases where an opcode can take several types of registers, and more than one of the sources or destinations are of variable type, then the register is passed in extended format. An extended format register number is of the form:

     register_number | register_type

where register_type is 0x100, 0x200, 0x400, or 0x800 for PMC, string, integer, or number respectively. So N19 would be 0x413.

Note: Instructions tagged with a * will call a vtable function to handle the instruction if used on PMC registers.

In all cases, the letters x, y, and z refer to register numbers. The letter t refers to a generic register (P, S, I, or N). A lowercase p, s, i, or n means either a register or constant of the appropriate type (PMC, string, integer, or number).

References

None.