DESCRIPTION

This is the fifth episode in a tutorial series on building a compiler with the Parrot Compiler Tools.

Episode 5: Variable Declaration and Scope

Episode 4 discussed the implementation of some statement types, such as the if-statement. In this episode we'll talk about variable declarations and scope handling. It's going to be a long story, so take your time to read this episode.

Globals, locals and default values

Squaak variables have one of two scopes: either they're global, or they're local. In order to create a global variable, you just assign some expression to an identifier (which hasn't been declared as a local). Local variables, on the other hand, must be declared using the "var" keyword. In other words, at any given point during the parsing phase, we have a list of variables that are known to be local variables. When an identifier is parsed, it is looked up and if found, its scope is set to local. If not, its scope is assumed to be global. When using an uninitialized variable, its value is set to an object called "Undef". Some examples are shown below.

 x = 42             # x was not declared, so it is global
 var k = 10         # k is local and initialized to 10
 a + b              # neither a nor b was declared;
                    # both default to the value "Undef"

Scoping and Symbol Tables

Earlier we mentioned the need to store declared local variables. In compiler jargon, such a data structure to store declarations is called a symbol table. For each individual scope, there's a separate symbol table.

Squaak has a so-called do-block statement, that is defined below.

 rule statement:sym<do> {
     <sym> <block> 'end'
 }

Each do-block defines a new scope; local variables declared between the do and end keywords are local to that block. An example to clarify this is shown below:

 do  var x = 1
     print(x)      # prints 1
     do
         var x = 2
         print(x)    # prints 2
     end
     print(x)      # prints 1
 end

So, each do/end pair defines a new scope, in which any declared variables hide variables with the same name in outer scopes. This behavior is common in many programming languages.

The PCT has built-in support for symbol tables; a PAST::Block object has a method symbol that can be used to enter new symbols and query the table for existing ones. In PCT, a PAST::Block object represents a scope. There are two blocktypes: immediate and declaration. An "immediate" block can be used to represent the blocks of statements in an do-block statement, for instance:

 do
     <block>
 end

When executing this statement, block is executed immediately. A "declaration" block, on the other hand, represents a block of statements that can be invoked at a later point, typically these are subroutines. So, in this example:

 sub foo(x)
     print(x)
 end

a PAST::Block object is created for the subroutine "foo". The blocktype is set to "declaration", as the subroutine is defined, not executed (immediately). For now you can forget about the blocktype, but now that I've told you, you'll recognize it when you see it. We'll come back to it in a later episode.

Implementing Scope

So, we know how to use global variables, declare local variables, and about PAST::Block objects representing scopes. How do we make our compiler to generate the right PIR instructions? After all, when handling a global variable, Parrot must handle this differently from handling a local variable. When creating PAST::Var nodes to represent the variables, we must know whether the variable is a local or a global variable. So, when handling variable declarations (of local variables; globals are not declared), we need to register the identifier as a local in the current block's symbol table. First, we'll take a look at the implementation of variable declarations.

Variable declaration

The following is the grammar rule for variable declarations. This is a type of statement, so I assume you know how to extend the statement rule to allow for variable declarations.

 rule statement:sym<var> {
     <sym> <identifier> ['=' <EXPR>]?
 }

A local variable is declared using the var keyword, and has an optional initialization expression. If the latter is missing, the variable's value defaults to the undefined value called "Undef". Let's see what the parse action looks like:

 method statement:sym<var>($/) {
     # get the PAST for the identifier
     my $past := $<identifier>.ast;

     # this is a local (it's being defined)
     $past.scope('lexical');

     # set a declaration flag
     $past.isdecl(1);

     # check for the initialization expression
     if $<EXPR> {
         # use the viviself clause to add a
         # an initialization expression
         $past.viviself($<EXPR>[0].ast);
     }
     else { # no initialization, default to "Undef"
         $past.viviself('Undef');
     }

     make $past;
 }

Well, that wasn't too hard, was it? Let's analyze what we just did. First we retrieved the PAST node for the identifier, which we then decorated by setting its scope to "lexical" (a local variable is said to be lexically scoped, hence "lexical"), and setting a flag indicating this node represents a declaration (isdecl). So, besides representing variables in other statements (for instance, assignments), a PAST::Var node is also used as a declaration statement.

Earlier in this episode we mentioned the need to register local variables in the current scope block when they are declared. So, when executing the parse action for variable-declaration, there should already be a PAST::Block node around, that can be used to register the symbol being declared. As we learned in Episode 4, PAST nodes are created in a depth-first fashion; the leafs are created first, and then the nodes "higher" in the parse tree. This implies that a PAST::Block node is created after the statement nodes (which variable_declaration is) that will be the children of the block. In the next section we'll see how to solve this problem.

Implementing a scope stack

In order to make sure that a PAST::Block node is created before any statements are parsed (and their parse actions are executed -- these might need to enter symbols in the block's symbol table), we add a few extra parse actions. Let's take a look at them.

Add this token to the grammar:

 token begin_TOP {
     <?>
 }

It uses something we haven't seen before, <?>. The null pattern <?> always returns true without consuming any text. Tokens consisting of only <?> are frequently used to invoke additional action methods.

Add this method to Actions.pm:

 method begin_TOP ($/) {
     our $?BLOCK := PAST::Block.new(:blocktype<declaration>, :node($/),
                                    :hll<squaak>);
     our @?BLOCK;
     @?BLOCK.unshift($?BLOCK);
 }

We create a new PAST::Block node and assign it to a strange-looking (if you don't know Perl, like me. Oh wait, this is Perl. Never mind..) variable called $?BLOCK. This variable is declared as "our", which means that it is a package variable. This means that the variable is shared by all methods in the same package (or class), and, equally important, the variable is still around after the parse action is done. Please refer to the Perl 6 specification for more semantics on "our". The variable $?BLOCK holds the current block.

After that, this block is unshifted onto another funny-looking variable, called @?BLOCK. This variable has a "@" sigil, meaning this is an array. The unshift method puts its argument on the front of the list. In a sense, you could think of the front of this list as the top of a stack. Later we'll see why this stack is necessary. This @?BLOCK variable is also declared with "our", meaning it's also package-scoped. Since it's an array variable, it is automatically initialized with an empty ResizablePMCArray.

Now we need to modify our TOP rule to call begin_TOP.

 rule TOP {
     <.begin_TOP>
     <statement_list>
     [ $ || <.panic: "Syntax error"> ]
 }

"<.begin_TOP>" is just like <begin_TOP>, calling the subrule begin_TOP, with one difference: The <.subrule> form does not capture. Normally, when match a subrule <foo>, $<foo> on the match object is bound to the subrule's match result. With <.foo>, $<foo> is not bound.

The parse action for begin_TOP is executed before any input is parsed, which is particularly suitable for any initialization actions you might need. The action for TOP is executed after the whole input string is parsed. Now we can create a PAST::Block node before any statements are parsed, so that when we need the current block, it's there (somewhere, later we'll see where exactly). Let's take a look at the parse action for TOP.

 method TOP($/, $key) {
     our @?BLOCK;
     my $past := @?BLOCK.shift();
     $past.push($<statement_list>.ast);
     make $past;
 }

Let's take a quick look at the updated parse action for TOP, which is executed after the whole input string is parsed. The PAST::Block node is retrieved from @?BLOCK, which makes sense, as it was created in the first part of the method and unshifted on @?BLOCK. Now this node can be used as the final result object of TOP. So, now we've seen how to use the scope stack, let's have a look at its implementation.

Storing Symbols

Now, we set up the necessary infrastructure to store the current scope block, and we created a datastructure that acts as a scope stack, which we will need later. We'll now go back to the parse action for statement:sym<var>, because we didn't enter the declared variable into the current block's symbol table yet. We'll see how to do that now. First, we need to make the current block accessible from the method statement:sym<var>. We've already seen how to do that, using the "our" keyword. It doesn't really matter where in the action method we enter the symbol's name into the symbol table, but let's do it at the end, after the initialization stuff. Naturally, we're only going to enter the symbol if it's not there already; duplicate variable declarations (in the same scope) should result in an error message (using the panic method of the match object). The code to be added to the method statement:sym<var> looks then like this:

 method statement:sym<var>($/) {
     our $?BLOCK;
     # get the PAST node for identifier
     # set the scope and declaration flag
     # do the initialization stuff
     # cache the name into a local variable
     my $name := $past.name();

     if $?BLOCK.symbol( $name ) {
         # symbol is already present
         $/.CURSOR.panic("Error: symbol " ~ $name ~ " was already defined.\n");
     }
     else {
         $?BLOCK.symbol( $name, :scope('lexical') );
     }
     make $past;
 }

What's Next?

With this code in place, variable declarations are handled correctly. However, we didn't update the parse action for identifier, which creates the PAST::Var node and sets its scope; currently all identifiers' scope is set to package (which means it's a global variable). As we already covered a lot of material in this episode, we'll leave this for the next episode. In the next episode, we'll also cover subroutines, which is another important aspect of any programming language. Hope to catch you later!

Exercises

Solution to the exercise

Keeping the Current block up to date

Sometimes we need to access the current block's symbol table. In order to be able to do so, we need a reference to the "current block". We do this by declaring a package variable called $?BLOCK, declared with "our" (as opposed with "my"). This variable will always point to the "current" block. As blocks can nest, we use a "stack", on which newly created blocks are stored. Whenever a new block is created, we assign this to $?BLOCK, and store it onto the stack, so that the next time a new block is created, the "old" current block isn't lost. Whenever a scope is closed, we pop off the current block from the stack, and restore the previous "current" block.

Why unshift/shift and not push/pop?

When we're talking about stacks, it would seem logical to talk about stack operations such as "push" and "pop". Instead, we use the operations "unshift" and "shift". If you're not a Perl programmer (such as myself), these names might not make sense. However, it's pretty easy. Instead of pushing a new object onto the "top" of the stack, you unshift objects onto this stack. Just see it as an old school bus, with only one entrance (at the front of the bus). Pushing a new person means taking the first free seat when entering, while unshifting a new person means everybody moves (shifts) one place to the back, so the new person can sit in the front seat. You might think this is not as efficient (more stuff is moved around), but that's not really true (actually: I guess (and certainly hope) the shift and unshift operations are implemented more effectively than the bus metaphor; I don't know how it is implemented).

So why unshift/shift, and not push/pop? When restoring the previous "current block", we need to know exactly where it is (what position). It would be nice to be able to always refer to the "first passenger on the bus", instead of the last person. We know how to reference the first passenger (it's on seat no. 0 (it was designed by an IT guy)); we don't really know what is the seat no. of the last person: s/he might sit in the middle, or at the back.

I hope it's clear what I mean here... otherwise, have a look at the code, and try to figure out what's happening:

    # In src/Squaak/Grammar.pm
    token begin_block {
        <?>
    }

    rule block {
        <.begin_block>
        <statement>*
    }

    # In src/Squaak/Actions.pm
    method begin_block {
        our $?BLOCK;
        our @?BLOCK;
        $?BLOCK := PAST::Block.new(:blocktype('immediate'),
                                   :node($/));
        @?BLOCK.unshift($?BLOCK);
    }

    method block($/, $key) {
        our $?BLOCK;
        our @?BLOCK;
        my $past := @?BLOCK.shift();
        $?BLOCK  := @?BLOCK[0];

        for $<statement> {
            $past.push($_.ast);
        }
        make $past;
    }