Episode 7: Operators and Precedence

Up till now, we've implemented a great deal of the Squaak language. We've seen assignments, control-flow statements, variable declarations and scope, subroutines and invocation. Our expressions have been limited so far to singular values, such as string literals and integer constants. In this episode, we'll enhance Squaak so it can handle operators, so you can construct more complex expressions.

Operators, precedence and parse trees

We will first briefly introduce the problem with recursive-descent parsers (which parsers generated with NQP are) when parsing expressions. Consider the following mini-grammar, which is a very basic calculator.

 rule TOP {
     <expression>*
 }

 rule expression {
     <term>
 }

 rule term {
     <factor> [ <addop> <factor> ]*
 }

 token addop { '+' | '-' }

 rule factor {
     <value> [ <mulop> <value> ]*
 }

 token mulop { '*' | '/' | '%' }

 rule value{
     | <number>
     | '(' <expression> ')'
 }

This basic expression grammar implements operator precedence by taking advantage of the nature of a recursive-descent parser (if you haven't seen the word, google it). However, the big disadvantage of parsing expressions this way, is that the parse trees can become quite large. Perhaps more importantly, the parsing process is not very efficient. Let's take a look at some sample input. We won't show the parse trees as shown in Episode 2, but we'll just show an outline.

 input: 42 results in this parse tree:

 TOP
   expression
     term
       factor
         value
           number
             42

As you can see, the input of this single number will invoke 6 grammar rules

before parsing the actual digits. Not that bad, you might think.

 input: "1 + 2" results in this parse tree (we ignore the operator for now):

 TOP
   expression
     term
       factor
       | value
       |   number
       |     1
       factor
         value
           number
             2

Only a few more grammar rules are invoked, not really a problem either.

 input: "(1 + 2) * 3" results in this parse tree:

 TOP
   expression
     term
       factor
         value
         | expression
         |   term
         |   | factor
         |   |   value
         |   |     number
         |   |       1
         |   term
         |     factor
         |       value
         |         number
         |           2
         value
           number
             3

Right; 16 grammar rules just to parse this simple input. I'd call this slightly inefficient. The point is, implementing operator precedence using a recursive-descent parser is somewhat problematic, and given the fact there are better methods to parse expressions like these, not the way to go. Check out this nice explanation or google it.

Bottom-up parsing and stacks: operator tables

I would like to explain to you how bottom-up parsing works for expressions (or bottom-up parsers in general; Yacc/Bison are parser generators that generate bottom-up parsers for your grammar specification), taking operator precedence into account. However, it's been about 6 years that I did this in a CS class, and I don't remember the particular details. If you really want to know, check out the links at the end of the previous section. It's actually worth checking out. For now, I'll just assume you know what the problem is, so that I'll introduce the solution for NQP-based compilers immediately. At some point when parsing your input, you might encounter an expression. At this point, we'd like the parser to switch from top-down to bottom-up parsing. NQP-rx supports this, and is used as follows:

 <EXPR>

Of course, the optable must be populated with some operators that we need to be able to parse and it might be told what precedence and associativity they have. The easiest way to do this is by setting up precedence levels in an INIT block:

    INIT {
        Squaak::Grammar.O(':prec<t>, :assoc<left>', '%additive');
        Squaak::Grammar.O(':prec<u>, :assoc<lefT>', '%multiplicative');
    }

In this INIT block, we use the O method of the compiler to set up two precedence levels: one for operators like addition (named %additive), and one for operators like multiplication (named %multiplicative). Each of themhas a ":prec" value and an ":assoc" value. ":prec" determines the precedence. Lexicographically greater values indicate higher precedence, so %additive operators, with a precedence value of "t", have lower precedence than %multiplicative operators with a precedence value of "u".":assoc" defines the associativity of the operators. If @ is a left associative operator, then 1 @ 2 @ 3 is equivalent to (1 @ 2) @ 3. However, if @ is right associative, then 1 @ 2 @ 3 is equivalent to 1 @ (2 @ 3). There are other options for the associativity, but we'll discuss them as we come to them.

    token infix:sym<*> { <sym> <O('%multiplicative, :pirop<mul>')> }

This defines the operator * (the infix: is a prefix that tells the operator parser that this operator is an infix operator; there are other types, such as prefix, postfix and others). As you can see, it uses the O rule to specify that it is part of the %multiplicative group of operators. The ":pirop" value specifies that the operator should compile to the mul PIR opcode.

Of course, the expression parser does not just parse operators, it must also parse the operands. So, how do we declare the most basic entity that represents an operand? It can be anything, from a basic integer-constant, a function call, or even a function definition (but adding two function definition doesn't really make sense, does it?). The operands are parsed in a recursive-descent fashion, so somewhere the parser must switch back from bottom-up (expression parsing) to top-down. This "switch-back" point is the proto token term. This is the reason why integer constants are parsed by the rule term:sym<integer_constant>, for example, in our grammar.

The term proto token is invoked every time a new operand is needed

Squaak Operators

We have defined the entry and exit point of the expression (bottom-up) parser, now it's time to add the operators. Let's have a look at Squaak's operators and their precedence. The operators are listed with decreasing precedence (so that high-precedence operators are listed at the top). (I'm not sure if this precedence table is common compared to other languages; some operators may have a different precedence w.r.t. other operators than you're used to. At least the mathematical operators are organized according to standard math rules).

 unary "-"
 unary "not"
 * / %
 + - ..
 < <= >= > != ==
 and
 or

(".." is the string concatenation operator). Besides defining an entry and exit point for the expression parser, you need to define precedence levels for your operators. Find the INIT block in Grammar.pm below the "## Operators" comment, and replace it with this:

    INIT {
        Squaak::Grammar.O(':prec<w>, :assoc<unary>', '%unary-negate');
        Squaak::Grammar.O(':prec<v>, :assoc<unary>', '%unary-not');
        Squaak::Grammar.O(':prec<u>, :assoc<left>',  '%multiplicative');
        Squaak::Grammar.O(':prec<t>, :assoc<left>',  '%additive');
        Squaak::Grammar.O(':prec<s>, :assoc<left>',  '%relational');
        Squaak::Grammar.O(':prec<r>, :assoc<left>',  '%conjunction');
        Squaak::Grammar.O(':prec<q>, :assoc<left>',  '%disjunction');
    }

Now, we need to define the actual operators:

    token infix:sym<or> { <sym> <O('%disjunction, :pasttype<unless>')> }
    token infix:sym<and> { <sym> <O('%conjunction, :pasttype<if>')> }
    token infix:symÂ«<Â» { <sym> <O('%relational, :pirop<islt>')> }
    token infix:sym<+> { <sym> <O('%additive, :pirop<add>')> }
    token infix:sym<*> { <sym> <O('%multiplicative, :pirop<mul>')> }
    token prefix:sym<not> { <sym> <O('%unary-not, :pirop<isfalse>')> }
    token prefix:sym<-> { <sym> <O('%unary-negate, :pirop<neg>')> }

Note that some operators are missing. See the exercises section for this.

Short-circuiting logical operators

Squaak has two logical operators: and and or; and results true if and only if both operands evaluate to true, while or results true if at least one of its operands evaluates to true. Both operands are short-circuited, which means that they don't evaluate both operands if that's unnecessary. For instance, if the first operand of the and operator evaluates to false, then there's no need to evaluate the second operand, as the final result of the and-expression cannot become true anymore (remember: both operands must evaluate to true).Let's think about how to implement this. When evaluating an and-expression, we first evaluate the first operand, and if it's true, only then does it make sense to evaluate the second operand. This behavior looks very much the same as an if-statement, doesn't it? In an if-statement, the first child is always evaluated, and if true, the second child (the then block) is evaluated (remember, the third child -- the else clause -- is optional). It would be great to be able to implement the and operator using a PAST::Op( :pasttype('if') ) node. Well, you can, using the ":pasttype" option! Here's how:

    token infix:sym<and> { <sym> <O('%conjunction, :pasttype<if>')> }

So what about the or operator? When evaluating an or-expression, the first operand is evaluated. If it evaluates to true, then there's no need to evaluate the second operand, as the result of the or-expression is already true! Only if the first operand evaluates to false, is it necessary to evaluate the second child. Mmmmm.... what we're saying here is, unless the first operand evaluates to true, evaluate the second child. Guess what pasttype you'd need for that!

Operators PAST types and PIR instructions

In the previous section, we introduced the pasttype clause that you can specify. This means that for that operator (for instance, the and operator we discussed), a PAST::Op( :pasttype('if') ) node is created. What happens if you don't specify a pasttype? In that case, the corresponding action method is called. Obviously, some languages have very exotic semantics for the + operator, but many languages just want to use Parrot's built-in add instruction. How do we achieve that?

Instead of adding a pasttype clause, specify a pirop clause. The pirop, or PIR operator, clause tells the code generator what operator should be generated. Instead of generating a subroutine invocation with the operands as arguments, it will generate the specified instruction with the operator's operands as arguments. Neat huh? Let's look at an example:

    token infix:sym<+> { <sym> <O('%additive, :pirop<add>')> }

This specifies to use the add instruction, which tells Parrot to create a new result object instead of changing one of the operands. PCT just emits the following for this:

 add $P12, $P10, $P11

which means that the PMCs in registers $P10 and $P11 are added, and assigned to a newly created PMC which is stored in register $P12.

To circumfix or not to circumfix

Squaak supports parenthesized expressions. Parentheses can be used to change the order of evaluation in an expression, just as you probably have seen in other languages. Besides infix, prefix and postfix operators, you can define circumfix operators, which is specified with the left and right delimiter. This is an ideal way to implement parenthesized expressions:

    token circumfix:sym<( )> { '(' <.ws> <EXPR> ')' }

    # with the action method:
    method circumfix:sym<( )> { make $<EXPR>.ast; }

This rule and action method were generated for us when we ran mk_language_shell.pl; you don't need to add them to the grammar and actions yourself. Circumfix operators are treated as terms by the operator-precedence parser, so it will parse as we want it to automatically.

Expression parser's action method

For all grammar rules we introduced, we also introduced an action method that is invoked after the grammar rule was done matching. What about the action method for EXPR? Our Squaak::Actions class inherits that from HLL::Actions. We don't have to write one.

What's Next?

This episode covered the implementation of operators, which allows us to write complex expressions. By now, most of our language is implemented, except for one thing: aggregate data structures. This will be the topic of Episode 8. We will introduce the two aggregate data types: array and hashtables, and see how we can implement these. We'll also discuss what happens when we pass such aggregates as subroutine arguments, and the difference with the basic data types.

Exercises

Currently, Squaak only has grammar rules for integer and string constants, not floating point constants. Implement this grammar rule. A floating-point number consists of zero or more digits, followed by a dot and at least one digit, or, at least one digit followed by a dot and any number of digits. Examples are:

 42.0, 1., .0001.

There may be no whitespace between the individual digits and the dot. Make sure you understand the difference between a "rule" and a "token".

Hint: a floating-point constant should produce a value of type 'Float'.

Note: in Perl 6 regexes, when matching an alternation as in a proto rule, the alternative which matches the most of the string is supposed to match. However, NQP-rx does not yet implement this. As a work-around, NQP-rx specifies that the version of a proto regex with the longest name will match. Since the part of a floating-point constant before the decimal place is the same as an integer constant, unless the token for floating-point constants has a longer name than the token for integer-constants, the latter will match and a syntax error will result.

Implement the missing operators: (binary) "-", "<=", ">=", "==", "!=", "/", "%", "or"

References

docs/pct/pct_optable_guide.pod

Solution to the exercises

A floating-point number consists of zero or more digits, followed by a dot and at least one digit, or, at least one digit followed by a dot and any number of digits. Examples are: 42.0, 1., .0001. There may be no whitespace between the individual digits and the dot. Make sure you understand the difference between a rule and a token.

Hint: a floating-point constant should produce a value of type 'Float'.

    token term:sym<float_constant_long> { # longer to work around lack of LTM
        [
        | \d+ '.' \d*
        | \d* '.' \d+
        ]
    }

    # with action method:
    method term:sym<float_constant_long>($/) { # name worksaround lack of LTM
        make PAST::Val.new(:value(+$/), :returns<Float>);
    }

For sake of completeness (and easy copy-paste for you), here's the list of operator declarations as I wrote them for Squaak:

    INIT {
        Squaak::Grammar.O(':prec<w>, :assoc<unary>', '%unary-negate');
        Squaak::Grammar.O(':prec<v>, :assoc<unary>', '%unary-not');
        Squaak::Grammar.O(':prec<u>, :assoc<left>',  '%multiplicative');
        Squaak::Grammar.O(':prec<t>, :assoc<left>',  '%additive');
        Squaak::Grammar.O(':prec<s>, :assoc<left>',  '%relational');
        Squaak::Grammar.O(':prec<r>, :assoc<left>',  '%conjunction');
        Squaak::Grammar.O(':prec<q>, :assoc<left>',  '%disjunction');
    }

    token circumfix:sym<( )> { '(' <.ws> <EXPR> ')' }

    token prefix:sym<-> { <sym> <O('%unary-negate, :pirop<neg>')> }
    token prefix:sym<not> { <sym> <O('%unary-not, :pirop<isfalse>')> }

    token infix:sym<*>  { <sym> <O('%multiplicative, :pirop<mul>')> }
    token infix:sym<%>  { <sym> <O('%multiplicative, :pirop<mod>')> }
    token infix:sym</>  { <sym> <O('%multiplicative, :pirop<div>')> }

    token infix:sym<+>  { <sym> <O('%additive, :pirop<add>')> }
    token infix:sym<->  { <sym> <O('%additive, :pirop<sub>')> }
    token infix:sym<..> { <sym> <O('%additive, :pirop<concat>')> }

    token infix:symÂ«<Â» { <sym> <O('%relational, :pirop<isle iPP>')> }
    token infix:symÂ«<=Â» { <sym> <O('%relational, :pirop<islt iPP>')> }
    token infix:symÂ«>Â» { <sym> <O('%relational, :pirop<isgt iPP>')> }
    token infix:symÂ«>=Â» { <sym> <O('%relational, :pirop<isge iPP>')> }
    token infix:symÂ«==Â» { <sym> <O('%relational, :pirop<iseq iPP>')> }
    token infix:symÂ«!=Â» { <sym> <O('%relational, :pirop<isne iPP>')> }

    token infix:sym<and> { <sym> <O('%conjunction, :pasttype<if>')> }
    token infix:sym<or> { <sym> <O('%disjunction, :pasttype<unless>')> }