parrotcode: The Grammar of languages/PIR | |
Contents | Language Implementations | PIR |
pirgrammar.pod - The Grammar of languages/PIR
This document provides a more readable grammar of languages/PIR. The actual specification for PIR is a bit more complex. This grammar for humans does not contain error handling and other issues unimportant for this PIR reference.
For a bugs and issues, see the section KNOWN ISSUES AND BUGS.
The grammar includes some constructs that are in the IMCC parser, but are not implemented.
Please note that languages/PIR is not the official definition of the PIR language.
The reference implementation of PIR is IMCC,
located in parrot/compilers/IMCC
.
However,
languages/PIR tries to be as close to IMCC as possible.
IMCC's grammar could use some cleaning up; languages/PIR might be a basis to start with a clean reimplementation of PIR in C (using Lex/Yacc).
0.2.0
PIR has a number of directives. All directives start with a dot. Macro identifiers (when using a macro, on expansion) also start with a dot (see below). Therefore, it is important not to use any of the PIR directives as a macro identifier. The PIR directives are:
.arg .invocant .pcc_begin
.const .lex .pcc_call
.emit .line .pcc_end_return
.end .loadlib .pcc_end_yield
.endnamespace .local .pcc_end
.eom .meth_call .pragma
.get_results .namespace .return
.globalconst .nci_call .result
.HLL_map .param .sub
.HLL .pcc_begin_return .yield
.include .pcc_begin_yield
PIR has two types of registers: real registers and symbolic or temporary (or virtual if you like) registers. Real registers are actual registers in the Parrot VM. The symbolic, or temporary registers are mapped to those actual registers. Real registers are written like:
[S|N|I|P]n, where n is a positive integer.
whereas symbolic registers have a $ prefix, like this: $P10
.
Symbolic registers can be thought of local variable identifiers that don't need a declaration. This prevents you from writing .local
directives if you're in a hurry. Of course, it would make the code more self-documenting if .local
s would be used.
An integer constant is a string of one or more digits. Examples: 0
, 42
.
A floatin-point constant is a string of one or more digits, followed by a dot and one or more digits. Examples: 1.1
, 42.567
A string constant is a single or double quoted series of characters. Examples: 'hello world'
, "Parrot"
.
TODO: PMC constants.
An identifier starts with a character from [_a-zA-Z], followed by zero or more characters from [_a-zA-Z0-9].
Examples: x
, x1
, _foo
A label is an identifier with a colon attached to it.
Examples: LABEL:
A macro identifier is an identifier prefixed with an dot. A macro identifier is used when expanding the macro (on usage), not in the macro definition.
Examples: .myMacro
A PIR program consists of one or more compilation units. A compilation unit is a global, sub, constant or macro definition, or a pragma or emit block. PIR is a line oriented language, which means that each statement ends in a newline (indicated as "nl"). Moreover, compilation units are always separated by a newline. Each of the different compilation units are discussed in this document.
program:
compilation_unit [ nl compilation_unit ]*
compilation_unit:
global_def
| sub_def
| const_def
| expansion
| pragma
| emit
sub_def:
".sub" sub_id sub_pragmas* nl body
sub_id:
identifier | string_constant
sub_pragma:
":load"
| ":init"
| ":immediate"
| ":postcomp"
| ":main"
| ":anon"
| ":lex"
| vtable_pragma
| multi_pragma
| outer_pragma
vtable_pragma:
":vtable" parenthesized_string?
parenthesized_string:
"(" string_constant ")"
multi_pragma:
":multi" "(" multi_types? ")"
outer_pragma:
":outer" "(" sub_id ")"
multi_tyes:
multi_type [ "," multi_type ]*
multi_type:
type
| "_"
| keylist
| identifier
| string_constant
body:
param_decl*
labeled_pir_instr*
".end"
param_decl:
".param" [ [ type identifier ] | register ] [ get_flags | ":unique_reg" ]* nl
get_flags:
[ ":slurpy"
| ":optional"
| ":opt_flag"
| named_flag
]+
named_flag:
":named" parenthesized_string?
The simplest example for a subroutine definition looks like:
.sub foo
# PIR instructions go here
.end
The body of the subroutine can contain PIR instructions. The subroutine can be given one or more flags, indicating the sub should behave in a special way. Below is a list of these flags and their meaning. The flag :unique_reg
is discussed in the section defining local declarations.
:load
Run this subroutine during the load_library opcode. :load is ignored, if another subroutine in that file is marked with :main. If multiple subs have the :load pragma, the subs are run in source code order.
:init
Run the subroutine when the program is run directly (that is, not loaded as a module). This is different from :load, which runs a subroutine when a library is being loaded. To get both behaviours, use :init :load.
:postcomp
Same as :immediate
, except that the sub is not executed when compilation was triggered by a load_bytecode
instruction (in a different file).
:immediate
This subroutine is executed immediately after being compiled. (Analagous to BEGIN
in perl5.)
:main
Indicates that the sub being defined is the entry point of the program. It can be compared to the main function in C.
:method
Indicates the sub being defined is an instance method. The method belongs to the class whose namespace is currently active. (so, to define a method for a class 'Foo', the 'Foo' namespace should be currently active). In the method body, the object PMC can be referred to with self
.
:vtable or vtable('x')
Indicates the sub being defined replaces a vtable entry. This flag can only be used when defining a method.
:multi(type [, type]*)
Engage in multiple dispatch with the listed types.
:outer('bar')
Indicates the sub being defined is lexically nested within the subroutine 'bar'.
:anon
Do not install this subroutine in the namespace. Allows the subroutine name to be reused.
:lex
Indicates the sub being defined needs to store lexical variables. This flag is not necessary if any lexical declarations are done (see below), the PIR compiler will figure this out by itself. The :lex
attribute is necessary to tell Parrot the subroutine will store or find lexicals.
The sub flags are listed after the sub name. The subroutine name can also be a string instead of a bareword, as is shown in this example:
.sub 'foo' :load :init :anon
# PIR body
.end
Parameter definitions have the following syntax:
.sub main
.param int argc :optional
.param int has_argc :optional
.param num nParam
.param pmc argv :slurpy
.param string sParam :named('foo')
.param $P0 :named('bar')
# body
.end
As shown, parameter definitions may take flags as well. These flags are listed here:
:slurpy
The parameter should be of type pmc
and acts like a container that slurps
up all remaining arguments. Details can be found in PDD03 - Parrot Calling Conventions.
:named('x')
The parameter is known in the called sub by name 'x'
. The :named
flag can also be used without an identifier, in combination with the :flat
or :slurpy
flag, i.e. on a container holding several values:
.param pmc args :slurpy :named
and
.arg args :flat :named
:optional
Indicates the parameter being defined is optional.
:opt_flag
This flag can be given to a parameter defined after an optional parameter. During runtime, the parameter is automatically given a value, and is not passed by the caller. The value of this parameter indicates whether the previous (optional) parameter was present.
The correct order of the parameters depends on the flag they have.
labeled_pir_instr:
label? instr nl
labeled_pasm_instr:
label? pasm_instr nl
instr:
pir_instr | pasm_instr
NOTE: the rule 'pasm_instr' is not included in this reference grammar. pasm_instr defines the syntax for pure PASM instructions.
pir_instr:
local_decl
| lexical_decl
| const_def
| globalconst_def
| conditional_stat
| assignment_stat
| open_namespace
| close_namespace
| return_stat
| sub_invocation
| macro_invocation
| jump_stat
| source_info
local_decl:
".local" type local_id_list
local_id_list:
local_id [ "," local_id ]*
local_id:
identifier ":unique_reg"?
Local temporary variables can be declared by the directives .local
.
.local int i
.local num a, b, c
The optional :unique_reg
modifier will force the register allocator to associate the identifier with a unique register for the duration of the compilation unit.
.local int j :unique_reg
lexical_decl:
".lex" string_constant "," target
The declaration
.lex 'i', $P0
indicates that the value in $P0 is stored as a lexical variable, named by 'i'. Once the above lexical declaration is written, and given the following statement:
$P1 = new 'Integer'
then the following two statements have an identical effect:
$P0 = $P1
store_lex "i", $P1
Likewise, these two statements also have an identical effect:
$P1 = $P0
$P1 = find_lex "i"
Instead of a register, one can also specify a local variable, like so:
.local pmc p
.lex 'i', p
The same is true when a parameter should be stored as a lexical:
.param pmc p
.lex 'i', p
So, now it is also clear why .lex 'i', p
is not a declaration of p: it needs a separate declaration, because it may either be a .local
or a .param
. The .lex
directive merely is a shortcut for saving and retrieving lexical variables.
const_def:
".const" type identifier "=" constant_expr
.const int answer = 42
defines an integer constant by name 'answer', giving it a value of 42. Note that the constant type and the value type should match, i.e. you cannot assign a floating point number to an integer constant. The PIR parser will check for this.
globalconst_def:
".globalconst" type identifier "=" constant_expr
This directive is similar to const_def
, except that once a global constant
has been defined, it is accessible from all subroutines.
.sub main :main
.global const int answer = 42
foo()
.end
.sub foo
print answer # prints 42
.end
conditional_stat:
[ "if" | "unless" ]
[ [ "null" target "goto" identifier ]
| [ simple_expr [ relational_op simple_expr ]? ]
] "goto" identifier
The syntax for if
and unless
statements is the same, except for the keyword itself. Therefore the examples will use either.
if null $P0 goto L1
Checks whether $P0
is null
, if it is, flow of control jumps to label L1
unless $P0 goto L2
unless x goto L2
unless 1.1 goto L2
Unless $P0, x or 1.1 are 'true', flow of control jumps to L2. When the argument is a PMC (like the first example), true-ness depends on the PMC itself. For instance, in some languages, the number 0 is defined as 'true', in others it is considered 'false' (like C).
if x < y goto L1
if y != z goto L1
are examples that check for the logical expression after if
. Any of the relational operators may be used here.
jump_stat:
"goto" identifier
goto MyLabel
The program will continue running at label 'MyLabel:'.
relational_op:
"==" | "!=" | "<=" | "<" | <"=" | <""
binary_op:
"+" | "-" | "/" | "**"
| "*" | "%" | "<<" | <">>"
| <">" | "&&" | "||" | "~~"
| "|" | "&" | "~" | "."
assign_op:
"+=" | "-=" | "/=" | "%=" | "*=" | ".="
| "&=" | "|=" | "~=" | "<<=" | <">=" | <">>="
unary_op:
"!" | "-" | "~"
expression:
simple_expr
| simple_expr binary_op simple_expr
| unary_op simple_expr
simple_expr:
float_constant
| int_constant
| string_constant
| target
42
42 + x
1.1 / 0.1
"hello" . "world"
str1 . str2
-100
~obj
!isSomething
Arithmetic operators are only allowed on floating-point numbers and integer values (or variables of that type). Likewise, string concatenation (".") is only allowed on strings. These checks are not done by the PIR parser.
assignment_stat:
target "=" short_sub_call
| target "=" target keylist
| target "=" expression
| target "=" "new" string_constant
| target "=" "new" keylist
| target "=" "find_type" [ string_constant | string_reg | id ]
| target "=" heredoc
| target assign_op simple_expr
| target keylist "=" simple_expr
| result_var_list "=" short_sub_call
NOTE: the definition of assignment statements is not complete yet. As languages/PIR evolves, this will be completed.
keylist:
"[" keys "]"
keys:
key [ sep key ]*
sep:
"," | ";"
key:
simple_expr
| simple_expr ".."
| ".." simple_expr
| simple_expr ".." simple_expr
result_var_list:
"(" result_vars? ")"
result_vars:
result_var [ "," result_var ]*
result_var:
target get_flags?
$I1 = 1 + 2
$I1 += 1
$P0 = foo()
$I0 = $P0[1]
$I0 = $P0[12.34]
$I0 = $P0["Hello"]
$P0 = new 42 # but this is really not very clear, better use identifiers
$S0 = <<'HELLO'
...
HELLO
.local int a, b, c
(a, b, c) = foo()
NOTE: the heredoc rules are not complete or tested. Some work is required here.
heredoc:
"<<" string_constant nl
heredoc_string
heredoc_label
heredoc_label:
^^ identifier
heredoc_string:
[ \N | \n ]*
.local string str
str = <<'ENDOFSTRING'
this text
is stored in the
variable
named 'str'. Whitespace and newlines
are stored as well.
ENDOFSTRING
Note that the Heredoc identifier should be at the beginning of the line, no whitespace in front of it is allowed. Printing str
would print:
this text
is stored in the
variable
named 'str'. Whitespace and newlines
are stored as well.
In IMCC, a heredoc identifier can be specified as an argument, like this:
foo(42, "hello", <<'EOS')
This is a heredoc text argument.
EOS
In IMCC, only one such argument can be specified. The languages/PIR implementation aims to allow for any number of heredoc arguments, like this:
foo(<<'STR1', <<'STR2')
argument 1
STR1
argument 2
STR2
Currently, this is not working.
sub_invocation:
long_sub_call | short_sub_call
long_sub_call:
".pcc_begin" nl
arguments
[ method_call | non_method_call] nl
[ local_decl nl ]*
result_values
".pcc_end"
non_method_call:
[ ".pcc_call" | ".nci_call" ] target
method_call:
".invocant" target nl
".meth_call" [ target | string_constant ]
parenthesized_args:
"(" args ")"
args:
arg [ "," arg ]
arg:
[ float_constant
| int_constant
| string_constant [ "=>" target ]?
| target
]
set_flags?
arguments:
[ ".arg" simple_expr set_flags? nl ]*
result_values:
[ ".result" target get_flags? nl ]*
set_flags:
[ ":flat"
| named_flag
]+
The long subroutine call syntax is very suitable to be generated by a language compiler targeting Parrot. Its syntax is rather verbose, but easy to read. The minimal invocation looks like this:
.pcc_begin
.pcc_call $P0
.pcc_end
Invoking instance methods is a simple variation:
.pcc_begin
.invocant $P0
.meth_call $P1
.pcc_end
Passing arguments and retrieving return values is done like this:
.pcc_begin
.arg 42
.pcc_call $P0
.local int res
.result res
.pcc_end
Arguments can take flags as well. The following argument flags are defined:
:flat
Flatten the (aggregate) argument. This argument can only be of type pmc
.
:named('x')
Pass the denoted argument into the named parameter that is denoted by 'x', like so:
.param int myX :named('x') # the type 'int' is just an example
As was mentioned at the parameter declaration section, the :named
section can be used on an aggregate value in combination with the :flat
flag.
.arg pmc myArgs :flat :named
.local pmc arr
arr = new .Array
arr = 2
arr[0] = 42
arr[1] = 43
.pcc_begin
.arg arr :flat
.arg $I0 :named('intArg')
.pcc_call foo
.pcc_end
The Native Calling Interface (NCI) allows for calling C routines, in order to talk to the world outside of Parrot. Its syntax is a slight variation; it uses .nci_call
instead of .pcc_call
.
.pcc_begin
.nci_call $P0
.pcc_end
short_sub_call:
invocant? [ target | string_constant ] parenthesized_args
invocant:
target"."
The short subroutine call syntax is useful when manually writing PIR code. Its simplest form is:
foo()
Or a method call:
obj.'toString'() # call the method 'toString'
obj.x() # call the method whose name is stored in 'x'.
Note that no spaces are allowed between the invocant and the dot; "obj . 'toString'"
is not valid, this will be interpreted as a concatenation.
And of course, using the short version, passing arguments can be done as well, including all flags that were defined for the long version. The same example from the 'long subroutine invocation' is now shown in its short version:
.local pmc arr
arr = new .Array
arr = 2
arr[0] = 42
arr[1] = 43
foo(arr :flat, $I0 :named('intArg'))
In order to do a Native Call Interface invocation, the subroutine to be invoked needs to be in referenced from a PMC register, as its name is not visible from Parrot. A NCI call looks like this:
.local pmc nci_sub, nci_lib
.local string c_function, signature
nci_lib = loadlib "myLib"
# name of the C function to be called
c_function = "sayHello"
# set signature to "void" (no arguments)
signature = "v"
# get a PMC representing the C function
nci_sub = dlfunc nci_lib, c_function, signature
# and invoke
nci_sub()
return_stat:
long_return_stat
| short_return_stat
| long_yield_stat
| short_yield_stat
| tail_call
long_return_stat:
".pcc_begin_return" nl
return_directive*
".pcc_end_return"
return_directive:
".return" simple_expr set_flags? nl
Returning values from a subroutine is in fact similar to passing arguments to a subroutine. Therefore, the same flags can be used:
.pcc_begin_return
.return 42 :named('answer')
.return $P0 :flat
.pcc_end_return
In this example, the value 42
is passed into the return value that takes the named return value known by 'answer'
. The aggregate value in $P0
is flattened, and each of its values is passed as a return value.
short_return_stat:
".return" parenthesized_args
.return(myVar, "hello", 2.76, 3.14);
Just as the return values in the long return statement
could take flags, the short return statement
may as well:
.return(42 :named('answer'), $P0 :flat)
long_yield_stat:
".pcc_begin_yield" nl
return_directive*
".pcc_end_yield"
A yield
statement works the same as a normal return value, except that the point where the subroutine was left is stored somewhere, so that the subroutine can be resumed from that point as soon as the subroutine is invoked again. Returning values is identical to normal return statements.
.sub foo
.pcc_begin_yield
.return 42
.pcc_end_yield
# and later in the sub, one could return another value:
.pcc_begin_yield
.return 43
.pcc_end_yield
.end
# when invoking twice:
foo() # returns 42
foo() # returns 43
short_yield_stat:
".yield" parenthesized_args
Again, the short version is identical to the short version of the return statement as well.
.yield("hello", 42)
tail_call:
".return" short_sub_call
.return foo()
Returns the return values from foo
. This is implemented by a tail call, which is more efficient than:
.local pmc results = foo()
.return(results)
The call to foo
can be considered a normal function call with respect to parameters: it can take the exact same format using argument flags. The tail call can also be a method call, like so:
.return obj.'foo'()
open_namespace:
".namespace" identifier
close_namespace:
".endnamespace" identifier
.sub main
.local int x
x = 42
say x
.namespace NESTED
.local int x
x = 43
say x
.endnamespace NESTED
say x
.end
Will print:
42
43
42
Please note that it is not necessary to pair these statements; it is acceptable to open a .namespace
without closing it. The scope of the .namespace
is limited to the subroutine.
emit:
".emit" nl
labeled_pasm_instr*
".eom"
An emit block only allows PASM instructions, not PIR instructions.
.emit
set I0, 10
new P0, .Integer
ret
_foo:
print "This is PASM subroutine "foo"
ret
.eom
expansion:
macro_def
| include
| pasm_constant
include:
".include" string_constant
pasm_constant:
".constant" identifier [ constant_value | register ]
macro_def:
".macro" identifier macro_parameters? nl
macro_body
macro_parameters:
"(" id_list? ")"
macro_body:
<labeled_pir_instr>*
".endm" nl
macro_invocation:
macro_id parenthesized_args?
Note that before a macro body will be parsed, some grammar rules will be changed. In a macro body, local variable declarations are done using the .macro_local
directive. TODO: decide on keyword for this.
The .label
directive is available for declaring unique labels.
macro_label:
".label" "$"identifier":"
When the following macro is defined:
.macro add2(n)
inc .n
inc .n
.endm
then one can write in a subroutine:
.sub foo
.local int myNum
myNum = 42
.add2(myNum)
print myNum # prints 44
.end
pragma:
new_operators
| loadlib
| namespace
| hll_mapping
| hll_specifier
| source_info
new_operators:
".pragma" "n_operators" int_constant
loadlib:
".loadlib" string_constant
namespace:
".namespace" [ "[" namespace_id "]" ]?
hll_specifier:
".HLL" string_constant "," string_constant
hll_mapping:
".HLL_map" string_constant "," string_constant
namespace_id:
string_constant [ ";" string_constant ]*
source_info:
".line" int_constant [ "," string_constant ]?
id_list:
identifier [ "," identifier ]*
.include "myLib.pir"
includes the source from the file "myLib.pir" at the point of this directive.
.pragma n_operators 1
makes Parrot automatically create new PMCs when using arithmetic operators, like:
$P1 = new 'Integer'
$P2 = new 'Integer'
$P1 = 42
$P2 = 43
$P0 = $P1 * $P2
# now, $P0 is automatically assigned a newly created PMC.
.line 100
.line 100, "myfile.pir"
NOTE: currently, the line directive is implemented in IMCC as #line. See the PROPOSALS document for more information on this.
.namespace ['Foo'] # namespace Foo
.namespace ['Object';'Foo'] # nested namespace
.namespace # no [ id ] means the root namespace is activated
The first line opens the namespace 'Foo'. When doing Object Oriented programming, this would indicate that sub or method definitions belong to the class 'Foo'. Of course, you can also define namespaces without doing OO-programming.
Please note that this .namespace
directive is different from the .namespace
directive that is used within subroutines.
.HLL "Lua", "lua_group"
is an example of specifying the High Level Language (HLL) for which the PIR is being generated. It is a shortcut for setting the namespace to 'Lua', and for loading the PMCs in the lua_group library.
.HLL_map "Integer", "LuaNumber"
is a way of telling Parrot, that whenever an Integer is created somewhere in the system (C code), instead a LuaNumber object is created.
.loadlib "myLib"
is a shortcut for telling Parrot that the library "myLib" should be loaded when running the program. In fact, it is a shortcut for:
.sub _load :load :anon
loadlib "myLib"
.end
TODO: check flags and syntax for this.
string_constant:
[ encoding_specifier? charset_specifier ]? quoted_string
encoding_specifier:
"utf8:"
charset_specifier:
"ascii:"
| "binary:"
| "unicode:"
| "iso-8859-1:"
type:
"int"
| "num"
| "pmc"
| "string"
target:
identifier | register
A string constant can be written like:
"Hello world"
but if desirable, the character set can be specified:
unicode:"Hello world"
When using the "unicode" character set, one can also specify an encoding specifier; currently only utf8
is allowed:
utf8:unicode:"hello world"
IMCC currently allows identifiers to be used as types. During the parse, the identifier is checked whether it is a defined class. The built-in types int, num, pmc and string are always available.
A target
is something that can be assigned to, it is an L-value (but of course may be read just like an R-value). It is either an identifier or a register.
Klaas-Jan Stol [parrotcode@gmail.com]
Some work should be done on:
Bugs or improvements may be sent to the author, and are of course greatly appreciated. Moreover, if you find any missing constructs that are in IMCC, indications of these would be appreciated as well.
Please see the PROPOSALS document for some proposals of the author to clean up the official grammar of PIR (as defined by the IMCC compiler).
0.2.0
:wrap
flag, remove .global
directive, remove .sym
directive, add .label
directive for macros, remove .pcc_sub
; remove some comments that are not true any more. In all, it's getting much cleaner!0.1.4
expansion
rule, moved include
and macro_def
rules to that rule. Added pasm_constant
definition.0.1.3
.globalconst
.0.1.2
.immediate
, it is :immediate
, and thus not a PIR directive, but a flag. This was a mistake..globalconst
:unique_reg
to allowed flags for incoming parameters.0.1.1
0.1
|