NAME ^

pirgrammar.pod - The Grammar of languages/PIR

DESCRIPTION ^

This document provides a more readable grammar of languages/PIR. The actual specification for PIR is a bit more complex. This grammar for humans does not contain error handling and other issues unimportant for this PIR reference.

STATUS ^

For a bugs and issues, see the section KNOWN ISSUES AND BUGS.

The grammar includes some constructs that are in the IMCC parser, but are not implemented.

Please note that languages/PIR is not the official definition of the PIR language. The reference implementation of PIR is IMCC, located in parrot/compilers/IMCC. However, languages/PIR tries to be as close to IMCC as possible. IMCC's grammar could use some cleaning up; languages/PIR might be a basis to start with a clean reimplementation of PIR in C (using Lex/Yacc).

VERSION ^

0.2.0

LEXICAL CONVENTIONS ^

PIR Directives ^

PIR has a number of directives. All directives start with a dot. Macro identifiers (when using a macro, on expansion) also start with a dot (see below). Therefore, it is important not to use any of the PIR directives as a macro identifier. The PIR directives are:

  .arg            .invocant          .pcc_begin
  .const          .lex               .pcc_call
  .emit           .line              .pcc_end_return
  .end            .loadlib           .pcc_end_yield
  .endnamespace   .local             .pcc_end
  .eom            .meth_call         .pragma
  .get_results    .namespace         .return
  .globalconst    .nci_call          .result
  .HLL_map        .param             .sub
  .HLL            .pcc_begin_return  .yield
  .include        .pcc_begin_yield

Registers ^

PIR has two types of registers: real registers and symbolic or temporary (or virtual if you like) registers. Real registers are actual registers in the Parrot VM. The symbolic, or temporary registers are mapped to those actual registers. Real registers are written like:

  [S|N|I|P]n, where n is a positive integer.

whereas symbolic registers have a $ prefix, like this: $P10.

Symbolic registers can be thought of local variable identifiers that don't need a declaration. This prevents you from writing .local directives if you're in a hurry. Of course, it would make the code more self-documenting if .locals would be used.

Constants ^

An integer constant is a string of one or more digits. Examples: 0, 42.

A floatin-point constant is a string of one or more digits, followed by a dot and one or more digits. Examples: 1.1, 42.567

A string constant is a single or double quoted series of characters. Examples: 'hello world', "Parrot".

TODO: PMC constants.

Identifiers ^

An identifier starts with a character from [_a-zA-Z], followed by zero or more characters from [_a-zA-Z0-9].

Examples: x, x1, _foo

Labels ^

A label is an identifier with a colon attached to it.

Examples: LABEL:

Macro identifiers ^

A macro identifier is an identifier prefixed with an dot. A macro identifier is used when expanding the macro (on usage), not in the macro definition.

Examples: .myMacro

GRAMMAR RULES ^

Compilation Units ^

A PIR program consists of one or more compilation units. A compilation unit is a global, sub, constant or macro definition, or a pragma or emit block. PIR is a line oriented language, which means that each statement ends in a newline (indicated as "nl"). Moreover, compilation units are always separated by a newline. Each of the different compilation units are discussed in this document.

  program:
    compilation_unit [ nl compilation_unit ]*

  compilation_unit:
      global_def
    | sub_def
    | const_def
    | expansion
    | pragma
    | emit

Subroutine definitions ^

  sub_def:
    ".sub" sub_id sub_pragmas* nl body

  sub_id:
    identifier | string_constant

  sub_pragma:
      ":load"
    | ":init"
    | ":immediate"
    | ":postcomp"
    | ":main"
    | ":anon"
    | ":lex"
    | vtable_pragma
    | multi_pragma
    | outer_pragma

  vtable_pragma:
    ":vtable" parenthesized_string?

  parenthesized_string:
    "(" string_constant ")"

  multi_pragma:
    ":multi" "(" multi_types? ")"

  outer_pragma:
    ":outer" "(" sub_id ")"

  multi_tyes:
    multi_type [ "," multi_type ]*

  multi_type:
      type
    | "_"
    | keylist
    | identifier
    | string_constant

  body:
    param_decl*
    labeled_pir_instr*
    ".end"

  param_decl:
    ".param"  [ [ type identifier ] | register ] [ get_flags | ":unique_reg" ]* nl

  get_flags:
    [ ":slurpy"
    | ":optional"
    | ":opt_flag"
    | named_flag
    ]+

  named_flag:
    ":named" parenthesized_string?

Examples subroutine

The simplest example for a subroutine definition looks like:

  .sub foo
  # PIR instructions go here
  .end

The body of the subroutine can contain PIR instructions. The subroutine can be given one or more flags, indicating the sub should behave in a special way. Below is a list of these flags and their meaning. The flag :unique_reg is discussed in the section defining local declarations.

The sub flags are listed after the sub name. The subroutine name can also be a string instead of a bareword, as is shown in this example:

  .sub 'foo' :load :init :anon
  # PIR body
  .end

Parameter definitions have the following syntax:

  .sub main
    .param int argc :optional
    .param int has_argc :optional
    .param num nParam
    .param pmc argv :slurpy
    .param string sParam :named('foo')
    .param $P0 :named('bar')
    # body
  .end

As shown, parameter definitions may take flags as well. These flags are listed here:

The correct order of the parameters depends on the flag they have.

PIR instructions ^

  labeled_pir_instr:
    label? instr nl

  labeled_pasm_instr:
    label? pasm_instr nl

  instr:
    pir_instr | pasm_instr

NOTE: the rule 'pasm_instr' is not included in this reference grammar. pasm_instr defines the syntax for pure PASM instructions.

  pir_instr:
      local_decl
    | lexical_decl
    | const_def
    | globalconst_def
    | conditional_stat
    | assignment_stat
    | open_namespace
    | close_namespace
    | return_stat
    | sub_invocation
    | macro_invocation
    | jump_stat
    | source_info

Local declarations ^

  local_decl:
    ".local" type local_id_list

  local_id_list:
    local_id [ "," local_id ]*

  local_id:
    identifier ":unique_reg"?

Examples local declarations

Local temporary variables can be declared by the directives .local.

  .local int i
  .local num a, b, c

The optional :unique_reg modifier will force the register allocator to associate the identifier with a unique register for the duration of the compilation unit.

  .local int j :unique_reg

Lexical declarations ^

  lexical_decl:
    ".lex" string_constant "," target

Example lexical declarations

The declaration

  .lex 'i', $P0

indicates that the value in $P0 is stored as a lexical variable, named by 'i'. Once the above lexical declaration is written, and given the following statement:

  $P1 = new 'Integer'

then the following two statements have an identical effect:

Likewise, these two statements also have an identical effect:

Instead of a register, one can also specify a local variable, like so:

  .local pmc p
  .lex 'i', p

The same is true when a parameter should be stored as a lexical:

  .param pmc p
  .lex 'i', p

So, now it is also clear why .lex 'i', p is not a declaration of p: it needs a separate declaration, because it may either be a .local or a .param. The .lex directive merely is a shortcut for saving and retrieving lexical variables.

Constant definitions ^

  const_def:
    ".const" type identifier "=" constant_expr

Example constant definitions

  .const int answer = 42

defines an integer constant by name 'answer', giving it a value of 42. Note that the constant type and the value type should match, i.e. you cannot assign a floating point number to an integer constant. The PIR parser will check for this.

Global constant definitions ^

  globalconst_def:
    ".globalconst" type identifier "=" constant_expr

Example global constant definitions

This directive is similar to const_def, except that once a global constant has been defined, it is accessible from all subroutines.

  .sub main :main
    .global const int answer = 42
    foo()
  .end

  .sub foo
    print answer # prints 42
  .end

Conditional statements ^

  conditional_stat:
      [ "if" | "unless" ]
    [ [ "null" target "goto" identifier ]
    | [ simple_expr [ relational_op simple_expr ]? ]
    ] "goto" identifier

Examples conditional statements

The syntax for if and unless statements is the same, except for the keyword itself. Therefore the examples will use either.

  if null $P0 goto L1

Checks whether $P0 is null, if it is, flow of control jumps to label L1

  unless $P0 goto L2
  unless x   goto L2
  unless 1.1 goto L2

Unless $P0, x or 1.1 are 'true', flow of control jumps to L2. When the argument is a PMC (like the first example), true-ness depends on the PMC itself. For instance, in some languages, the number 0 is defined as 'true', in others it is considered 'false' (like C).

  if x < y goto L1
  if y != z  goto L1

are examples that check for the logical expression after if. Any of the relational operators may be used here.

Branching statements ^

  jump_stat:
    "goto" identifier

Examples branching statements

  goto MyLabel

The program will continue running at label 'MyLabel:'.

Operators ^

  relational_op:
      "==" | "!=" | "<=" | "<" | <"=" | <""

  binary_op:
      "+"  | "-"   | "/"  | "**"
    | "*"  | "%"   | "<<" | <">>"
    | <">" | "&&"  | "||" | "~~"
    | "|"  | "&"   | "~"  | "."

  assign_op:
      "+=" | "-=" | "/=" | "%="  | "*="  | ".="
    | "&=" | "|=" | "~=" | "<<=" | <">=" | <">>="

  unary_op:
      "!" | "-" | "~"

Expressions ^

  expression:
      simple_expr
    | simple_expr binary_op simple_expr
    | unary_op simple_expr

  simple_expr:
      float_constant
    | int_constant
    | string_constant
    | target

Example expressions

  42
  42 + x
  1.1 / 0.1
  "hello" . "world"
  str1 . str2
  -100
  ~obj
  !isSomething

Arithmetic operators are only allowed on floating-point numbers and integer values (or variables of that type). Likewise, string concatenation (".") is only allowed on strings. These checks are not done by the PIR parser.

Assignments ^

  assignment_stat:
      target "=" short_sub_call
    | target "=" target keylist
    | target "=" expression
    | target "=" "new" string_constant
    | target "=" "new" keylist
    | target "=" "find_type" [ string_constant | string_reg | id ]
    | target "=" heredoc
    | target assign_op simple_expr
    | target keylist "=" simple_expr
    | result_var_list "=" short_sub_call

NOTE: the definition of assignment statements is not complete yet. As languages/PIR evolves, this will be completed.

  keylist:
    "[" keys "]"

  keys:
    key [ sep key ]*

  sep:
    "," | ";"

  key:
      simple_expr
    | simple_expr ".."
    | ".." simple_expr
    | simple_expr ".." simple_expr

  result_var_list:
    "(" result_vars? ")"

  result_vars:
    result_var [ "," result_var ]*

  result_var:
    target get_flags?

Examples assignment statements

  $I1 = 1 + 2
  $I1 += 1
  $P0 = foo()
  $I0 = $P0[1]
  $I0 = $P0[12.34]
  $I0 = $P0["Hello"]
  $P0 = new 42 # but this is really not very clear, better use identifiers

  $S0 = <<'HELLO'
  ...
  HELLO

  .local int a, b, c
  (a, b, c) = foo()

Heredoc ^

NOTE: the heredoc rules are not complete or tested. Some work is required here.

  heredoc:
    "<<" string_constant nl
    heredoc_string
    heredoc_label

  heredoc_label:
    ^^ identifier

  heredoc_string:
    [ \N | \n ]*

Example Heredoc

  .local string str
  str = <<'ENDOFSTRING'
    this text
         is stored in the
               variable
      named 'str'. Whitespace and newlines
    are                  stored as well.
  ENDOFSTRING

Note that the Heredoc identifier should be at the beginning of the line, no whitespace in front of it is allowed. Printing str would print:

    this text
         is stored in the
               variable
      named 'str'. Whitespace and newlines
    are                  stored as well.

In IMCC, a heredoc identifier can be specified as an argument, like this:

    foo(42, "hello", <<'EOS')

    This is a heredoc text argument.

  EOS

In IMCC, only one such argument can be specified. The languages/PIR implementation aims to allow for any number of heredoc arguments, like this:

    foo(<<'STR1', <<'STR2')

    argument 1
  STR1
    argument 2
  STR2

Currently, this is not working.

Invoking subroutines and methods ^

  sub_invocation:
    long_sub_call | short_sub_call

  long_sub_call:
    ".pcc_begin" nl
    arguments
    [ method_call | non_method_call] nl
    [ local_decl nl ]*
    result_values
    ".pcc_end"

  non_method_call:
    [ ".pcc_call" | ".nci_call" ] target

  method_call:
    ".invocant" target nl
    ".meth_call" [ target | string_constant ]

  parenthesized_args:
    "(" args ")"

  args:
    arg [ "," arg ]

  arg:
    [ float_constant
    | int_constant
    | string_constant [ "=>" target ]?
    | target
    ]
    set_flags?


  arguments:
    [ ".arg" simple_expr set_flags? nl ]*

  result_values:
    [ ".result" target get_flags? nl ]*

  set_flags:
    [ ":flat"
    | named_flag
    ]+

Example long subroutine call

The long subroutine call syntax is very suitable to be generated by a language compiler targeting Parrot. Its syntax is rather verbose, but easy to read. The minimal invocation looks like this:

  .pcc_begin
  .pcc_call $P0
  .pcc_end

Invoking instance methods is a simple variation:

  .pcc_begin
  .invocant $P0
  .meth_call $P1
  .pcc_end

Passing arguments and retrieving return values is done like this:

  .pcc_begin
  .arg 42
  .pcc_call $P0
  .local int res
  .result res
  .pcc_end

Arguments can take flags as well. The following argument flags are defined:

  .local pmc arr
  arr = new .Array
  arr = 2
  arr[0] = 42
  arr[1] = 43
  .pcc_begin
  .arg arr :flat
  .arg $I0 :named('intArg')
  .pcc_call foo
  .pcc_end

The Native Calling Interface (NCI) allows for calling C routines, in order to talk to the world outside of Parrot. Its syntax is a slight variation; it uses .nci_call instead of .pcc_call.

  .pcc_begin
  .nci_call $P0
  .pcc_end

Short subroutine invocation ^

  short_sub_call:
    invocant? [ target | string_constant ] parenthesized_args

  invocant:
    target"."

Example short subroutine call

The short subroutine call syntax is useful when manually writing PIR code. Its simplest form is:

  foo()

Or a method call:

  obj.'toString'() # call the method 'toString'
  obj.x() # call the method whose name is stored in 'x'.

Note that no spaces are allowed between the invocant and the dot; "obj . 'toString'" is not valid, this will be interpreted as a concatenation.

And of course, using the short version, passing arguments can be done as well, including all flags that were defined for the long version. The same example from the 'long subroutine invocation' is now shown in its short version:

  .local pmc arr
  arr = new .Array
  arr = 2
  arr[0] = 42
  arr[1] = 43
  foo(arr :flat, $I0 :named('intArg'))

In order to do a Native Call Interface invocation, the subroutine to be invoked needs to be in referenced from a PMC register, as its name is not visible from Parrot. A NCI call looks like this:

  .local pmc nci_sub, nci_lib
  .local string c_function, signature

  nci_lib = loadlib "myLib"

  # name of the C function to be called
  c_function = "sayHello"

  # set signature to "void" (no arguments)
  signature  = "v"

  # get a PMC representing the C function
  nci_sub = dlfunc nci_lib, c_function, signature

  # and invoke
  nci_sub()

Return values from subroutines ^

  return_stat:
      long_return_stat
    | short_return_stat
    | long_yield_stat
    | short_yield_stat
    | tail_call

  long_return_stat:
    ".pcc_begin_return" nl
    return_directive*
    ".pcc_end_return"

  return_directive:
    ".return" simple_expr set_flags? nl

Example long return statement

Returning values from a subroutine is in fact similar to passing arguments to a subroutine. Therefore, the same flags can be used:

  .pcc_begin_return
  .return 42 :named('answer')
  .return $P0 :flat
  .pcc_end_return

In this example, the value 42 is passed into the return value that takes the named return value known by 'answer'. The aggregate value in $P0 is flattened, and each of its values is passed as a return value.

Short return statement ^

  short_return_stat:
    ".return" parenthesized_args

Example short return statement

  .return(myVar, "hello", 2.76, 3.14);

Just as the return values in the long return statement could take flags, the short return statement may as well:

  .return(42 :named('answer'), $P0 :flat)

Long yield statements ^

  long_yield_stat:
    ".pcc_begin_yield" nl
    return_directive*
    ".pcc_end_yield"

Example long yield statement

A yield statement works the same as a normal return value, except that the point where the subroutine was left is stored somewhere, so that the subroutine can be resumed from that point as soon as the subroutine is invoked again. Returning values is identical to normal return statements.

  .sub foo
    .pcc_begin_yield
    .return 42
    .pcc_end_yield

    # and later in the sub, one could return another value:

    .pcc_begin_yield
    .return 43
    .pcc_end_yield
  .end

  # when invoking twice:
  foo() # returns 42
  foo() # returns 43

Short yield statements ^

  short_yield_stat:
    ".yield" parenthesized_args

Example short yield statement

Again, the short version is identical to the short version of the return statement as well.

  .yield("hello", 42)

Tail calls ^

  tail_call:
    ".return" short_sub_call

Example tail call

  .return foo()

Returns the return values from foo. This is implemented by a tail call, which is more efficient than:

  .local pmc results = foo()
  .return(results)

The call to foo can be considered a normal function call with respect to parameters: it can take the exact same format using argument flags. The tail call can also be a method call, like so:

  .return obj.'foo'()

Symbol namespaces ^

  open_namespace:
    ".namespace" identifier

  close_namespace:
    ".endnamespace" identifier

Example open/close namespaces

  .sub main
    .local int x
    x = 42
    say x
    .namespace NESTED
    .local int x
    x = 43
    say x
    .endnamespace NESTED
    say x
  .end

Will print:

  42
  43
  42

Please note that it is not necessary to pair these statements; it is acceptable to open a .namespace without closing it. The scope of the .namespace is limited to the subroutine.

Emit blocks ^

  emit:
    ".emit" nl
    labeled_pasm_instr*
    ".eom"

Example Emit block

An emit block only allows PASM instructions, not PIR instructions.

  .emit
     set I0, 10
     new P0, .Integer
     ret
   _foo:
     print "This is PASM subroutine "foo"
     ret
  .eom

Expansions ^

  expansion:
      macro_def
    | include
    | pasm_constant


  include:
    ".include" string_constant

  pasm_constant:
    ".constant" identifier [ constant_value | register ]

Macros ^

  macro_def:
    ".macro" identifier macro_parameters? nl
    macro_body

  macro_parameters:
    "(" id_list? ")"

  macro_body:
    <labeled_pir_instr>*
    ".endm" nl

  macro_invocation:
    macro_id parenthesized_args?

Note that before a macro body will be parsed, some grammar rules will be changed. In a macro body, local variable declarations are done using the .macro_local directive. TODO: decide on keyword for this.

The .label directive is available for declaring unique labels.

  macro_label:
    ".label" "$"identifier":"

Example Macros

When the following macro is defined:

  .macro add2(n)
    inc .n
    inc .n
  .endm

then one can write in a subroutine:

  .sub foo
    .local int myNum
    myNum = 42
    .add2(myNum)
    print myNum  # prints 44
  .end

PIR Pragmas ^

  pragma:
      new_operators
    | loadlib
    | namespace
    | hll_mapping
    | hll_specifier
    | source_info

  new_operators:
    ".pragma" "n_operators" int_constant

  loadlib:
    ".loadlib" string_constant

  namespace:
    ".namespace" [ "[" namespace_id "]" ]?

  hll_specifier:
    ".HLL" string_constant "," string_constant

  hll_mapping:
    ".HLL_map" string_constant "," string_constant

  namespace_id:
    string_constant [ ";" string_constant ]*

  source_info:
    ".line" int_constant [ "," string_constant ]?

  id_list:
    identifier [ "," identifier ]*

Examples pragmas

  .include "myLib.pir"

includes the source from the file "myLib.pir" at the point of this directive.

  .pragma n_operators 1

makes Parrot automatically create new PMCs when using arithmetic operators, like:

  $P1 = new 'Integer'
  $P2 = new 'Integer'
  $P1 = 42
  $P2 = 43
  $P0 = $P1 * $P2
  # now, $P0 is automatically assigned a newly created PMC.


  .line 100
  .line 100, "myfile.pir"

NOTE: currently, the line directive is implemented in IMCC as #line. See the PROPOSALS document for more information on this.

  .namespace ['Foo'] # namespace Foo

  .namespace ['Object';'Foo'] # nested namespace

  .namespace # no [ id ] means the root namespace is activated

The first line opens the namespace 'Foo'. When doing Object Oriented programming, this would indicate that sub or method definitions belong to the class 'Foo'. Of course, you can also define namespaces without doing OO-programming.

Please note that this .namespace directive is different from the .namespace directive that is used within subroutines.

  .HLL "Lua", "lua_group"

is an example of specifying the High Level Language (HLL) for which the PIR is being generated. It is a shortcut for setting the namespace to 'Lua', and for loading the PMCs in the lua_group library.

  .HLL_map "Integer", "LuaNumber"

is a way of telling Parrot, that whenever an Integer is created somewhere in the system (C code), instead a LuaNumber object is created.

  .loadlib "myLib"

is a shortcut for telling Parrot that the library "myLib" should be loaded when running the program. In fact, it is a shortcut for:

  .sub _load :load :anon
    loadlib "myLib"
  .end

TODO: check flags and syntax for this.

Tokens, types and targets ^

  string_constant:
    [ encoding_specifier? charset_specifier ]?  quoted_string

  encoding_specifier:
    "utf8:"

  charset_specifier:
      "ascii:"
    | "binary:"
    | "unicode:"
    | "iso-8859-1:"

  type:
      "int"
    | "num"
    | "pmc"
    | "string"

  target:
    identifier | register

Notes on Tokens, types and targets

A string constant can be written like:

  "Hello world"

but if desirable, the character set can be specified:

  unicode:"Hello world"

When using the "unicode" character set, one can also specify an encoding specifier; currently only utf8 is allowed:

  utf8:unicode:"hello world"

IMCC currently allows identifiers to be used as types. During the parse, the identifier is checked whether it is a defined class. The built-in types int, num, pmc and string are always available.

A target is something that can be assigned to, it is an L-value (but of course may be read just like an R-value). It is either an identifier or a register.

AUTHOR ^

Klaas-Jan Stol [parrotcode@gmail.com]

KNOWN ISSUES AND BUGS ^

Some work should be done on:

REFERENCES ^

CHANGES ^

0.2.0

0.1.4

0.1.3

0.1.2

0.1.1

0.1


parrot