Grammar Engine

The Parrot Grammar Engine (PGE) is a parser generator, one of the key components of the Parrot Compiler Toolkit. It reads grammar files written in the PGE rules format, and generates parser modules for the syntax specified by those rules. PGE rules provide the full power of recursive descent parsing and operator precedence parsing, but are comfortably useful even if you don't know anything about parsing theory. In the usual case, all you'll ever need to know is that rules are patterns for matching text.

Grammars

A grammar is a collection of rules, in much the same way that a class is a collection of methods.In fact, a grammar is just a special kind of class and a rule is just a special kind of method. Each rule defines a pattern for matching one piece of text. The basic matching syntax for rules is likely to be somewhat familiar to anyone who has worked with regular expressions.

  rule id { \d+ }

Larger rules are composed from smaller rules:

  rule record { <id> <name> <phone> }

And a grammar holds a group of rules that work together:

  grammar Contacts;

  rule name { 'John' | 'Bob ' | 'Fred' }

  rule id   { \d+ }

  rule record { <id> <name> }

  ...

Rules and Tokens

There are two different kinds of rules, each declared with a different keyword: rule or token. A rule does smart whitespace matching between the various pieces of the pattern. So, the record rule above would match "6355 John" or "6355 John" but not "6355John".

A token only matches whitespace if you specifically request it. To get the same effect with a token, you'd have to add a \s pattern to the rule:

  token record { <id> \s+ <name> }

The Start Rule

A recursive descent parser is a top-down parser. This means it starts at the highest-level rule and works its way down through individual rules to match an entire string or file. In PGE, this top-level rule is called TOP by convention.In "real" Perl 6, the top-level rule can be called anything. However, some of the PCT compiler tools expect it to be called TOP. Compilers written with PCT should just use the name TOP for all top-level tokens.

  rule TOP { <record> }

This rule matches a single record in a string or file. Once the parser has matched the entire string or file passed to the start rule, the parse is considered successful and a raw parse tree is returned.

Testing a Grammar

You might want to try out the examples in this chapter as you read along. To compile the following simple example, save it to a file called Contacts.pg:

  grammar Contacts is PGE::Grammar;
  rule  TOP    { <record> }
  rule  record { <id> <name> }
  token name   { 'John' | 'Bob ' | 'Fred' }
  token id     { \d+ }

Then compile the grammar with the following command:

  $ parrot Perl6Grammar.pbc --output=Contacts.pir Contacts.pg

The path to parrot and to the Perl6Grammar.pbc file will vary on different systems. If you compiled Parrot from source, it will be:

  $ ./parrot runtime/parrot/library/PGE/Perl6Grammar.pbc \
        --output=Contacts.pir Contacts.pg

Next, create a small PIR script to run your grammar. You can save it as grammar_test.pir.

  .sub main :main
      load_bytecode 'PGE.pbc'        # load some required modules
      load_bytecode 'dumper.pbc'
      load_bytecode 'PGE/Dumper.pbc'

      load_bytecode 'Contacts.pir'   # load your grammar

      .local string source
      .local pmc top, grammar, match

      source = "3 John"
      top = get_hll_global ['Contacts'], 'TOP'
      grammar = get_class 'Contacts'
      match = top(source, 'grammar' => grammar)
      _dumper(match, "match")
  .end

Now, you can run the test script with this command:

  $ parrot grammar_test.pir

It will print out a text representation of the raw parse tree stored in the match variable:

  "match" => PMC 'Contacts' => "3 John" @ 0 {
      <record> => PMC 'Contacts' => "3 John" @ 0 {
          <id> => PMC 'Contacts' => "3" @ 0
          <name> => PMC 'Contacts' => "John" @ 2
      }
  }

Each rule in the grammar corresponds to a node in the tree. This output shows that the top-level match variable contains one child named "record", that "record" contains two children named "id" and "name", and that "id" contains the number 3, and "name" contains the string "John". Exactly what we would expect from our simple grammar.

Rule Syntax

Every language has a set of basic components (words or parts of words) and syntax conventions for combining them. The "words" in rules are literal characters or symbols, some metacharacters (or metasymbols), and escape sequences, while the combining syntax includes other metacharacters, quantifiers, bracketing characters, and assertions.

Metacharacters

The . matches any single character, even a newline character. The ^ and $ metacharacters are zero-width matches on the beginning and end of a string. They each have doubled alternates ^^ and $$ that match at the beginning and end of every line within a string.

The |, &, \, #, and := metacharacters are all syntax structure elements. The | is an alternation between two options. The & matches two patterns simultaneously (the patterns must be the same length). The \ turns literal characters into metacharacters (the escape sequences). The # marks a comment to the end of the line. You can start a comment at any point on any line in a rule. The := binds a hypothetical variable to the result of a subrule or grouped pattern. Hypotheticals are covered in "Hypothetical Variables" later in this chapter.

The metacharacters (), [], {} and <> are bracketing pairs. The pairs always have to be balanced within the rule, unless they are literal characters (escaped with a \). The () and [] pairs group patterns to match as a single atom. They're often used to capture a result, mark the boundaries of an alternation, or mark a group of patterns with a quantifier. Parentheses () are capturing and square brackets [] are non-capturing. The {} brackets define a section of code (a closure) within a rule. These closures are always a successful zero-width match. The <...> brackets mark assertions, which handle a variety of constructs including character classes and user-defined quantifiers. Assertions are covered in Assertions later in this chapter.

Table 7-2 summarizes the basic set of metacharacters.

Escape Sequences

The escape sequences are literal characters acting as metacharacters, marked with the \ escape. Some escape sequences represent single characters that are difficult to represent literally, like \t for tab, or \x[...] for a character specified by a hexadecimal number. Some represent limited character classes, like \d for digits or \w for word characters. Some represent zero-width positions in a match, like \b for a word boundary. With all the escape sequences that use brackets, (), {}, and <> work in place of [].

Note that since an ordinary variable now interpolates as a literal string by default, the \Q escape sequence is rarely needed.

CHP-7-TABLE-3Table 7-3 shows the escape sequences for rules.

Quantifiers

Quantifiers specify the number of times an atom (a single character, metacharacter, escape sequence, grouped pattern, assertion, etc) will match.

The numeric quantifiers use assertion syntax. A single number (<3>) requires exactly that many matches. A numeric range quantifier (<3..5>) succeeds if the number of matches is between the minimum and maximum numbers. A range with three trailing dots (<2...>) is shorthand for <n..Inf> and matches as many times as possible.

Each quantifier has a minimal alternate form, marked with a trailing ?, that matches the shortest possible sequence first.

CHP-7-TABLE-4Table 7-4 shows the built-in quantifiers.

Assertions

In general, an assertion simply states that some condition or state is true and the match fails when that assertion is false. Many different constructs with many different purposes use assertion syntax.

Assertions match named and anonymous rules, arrays or hashes containing anonymous rules, and subroutines or closures that return anonymous rules. You have to enclose a variable in assertion delimiters to get it to interpolate as an anonymous rule or rules. A bare scalar in a pattern interpolates as a literal string, while a scalar variable in assertion brackets interpolates as an anonymous rule. A bare array in a pattern matches as a series of alternate literal strings, while an array in assertion brackets interpolates as a series of alternate anonymous rules. In the simplest case, a bare hash in a pattern matches a word (\w+) and tries to find that word as one of its keys.The effect is much as if it matched the keys as a series of alternates, but you're guaranteed to match the longest possible key, instead of just the first one it hits in random order., while a hash in assertion brackets does the same, but then also matches the associated value as an anonymous rule.

A bare closure in a pattern always matches (unless it calls fail), but a closure in assertion brackets <{...}> must return an anonymous rule, which is immediately matched.

An assertion with parentheses <(...)> is similar to a bare closure in a pattern in that it allows you to include straight Perl code within a rule. The difference is that <(...)> evaluates the return value of the closure in boolean context. The match succeeds if the return value is true and fails if the return value is false.

Assertions match character classes, both named and enumerated. A named rule character class is often more accurate than an enumerated character class. For example, <[a-zA-Z]> is commonly used to match alphabetic characters, but generally what's really needed is the built-in rule <alpha> which matches the full set of Unicode alphabetic characters.

CHP-7-TABLE-5Table 7-5 shows the syntax for assertions.

Modifiers

Modifiers alter the meaning of the pattern syntax. The standard position for modifiers is at the beginning of the rule, right after the m, s, or rx, or after the name in a named rule. Modifiers cannot attach to the outside of a bare /.../. For example:

  m:i/marvin/ # case insensitive
  rule names :i { marvin | ford | arthur }

The single-character modifiers can be grouped, but the others must be separated by a colon:

  m:wig/ zaphod /                        # OK
  m:words:ignorecase:globally / zaphod / # OK
  m:wordsignorecaseglobally / zaphod /   # Not OK

Most of the modifiers can also go inside the rule, attached to the rule delimiters or to grouping delimiters. Internal modifiers are lexically scoped to their enclosing delimiters, so you get a temporary alteration of the pattern:

  m/:w I saw [:i zaphod] / # only 'zaphod' is case insensitive

The repetition modifiers (:Nx, :Nth, :once, :globally, and :exhaustive) and the continue modifier (:cont) can't be lexically scoped, because they alter the return value of the entire rule.

The :Nx modifier matches the rule a counted number of times. If the modifier expects more matches than the string has, the match fails. It has an alternate form :x(N) that can take a variable in place of the number.

The :once modifier on a rule only allows it to match once. The rule will not match again until the you call the .reset method on the rule object.

The :globally modifier matches as many times as possible. The :exhaustive modifier also matches as many times as possible, but in as many different ways as possible.

The :Nth modifier preserves one result from a particular counted match. If the rule matches fewer times than the modifier expects, the match fails. It has several alternate forms. One form--:th(N)--can take a variable in place of the number. The other forms--:Nst, :Nnd, and :Nrd--are for cases where it's more natural to write :1st, :2nd, :3rd than it is to write :1th, :2th, :3th. Either way is valid, so pick the one that's most comfortable for you.

By default, rules ignore literal whitespace within the pattern. The :w modifier makes rules sensitive to literal whitespace, but in an intelligent way. Any cluster of literal whitespace acts like an explicit \s+ when it separates two identifiers and \s* everywhere else.

There are no modifiers to alter whether the matched string is treated as a single line or multiple lines. That's why the "beginning of string" and "end of string" metasymbols have "beginning of line" and "end of line" counterparts.

CHP-7-TABLE-6Table 7-6 shows the current list of modifiers.

Built-in Rules

A number of named rules are provided by default, including a complete set of POSIX-style classes, and Unicode property classes. The list isn't fully defined yet, but CHP-7-TABLE-7Table 7-7 shows a few you're likely to see.

The <null> rule matches a zero-width string (so it's always true) and <prior> matches whatever the most recent successful rule matched. These replace the two behaviors of the Perl 5 null pattern //, which is no longer valid syntax for rules.

Backtracking Control

Backtracking is triggered whenever part of the pattern fails to match. You can also explicitly trigger backtracking by calling the fail function within a closure. CHP-7-TABLE-8Table 7-8 shows some metacharacters and built-in rules relevant to backtracking.

Calling Actions

Once the parser has matched the entire input a source code file, or a line of input at the terminal in interactive mode the parse is considered successful and the generated AST is delivered to the code generator for conversion into PIR.

We haven't covered actions yet, but it's still important now to talk about how we will call them when we are ready. We call an action by inserting the {*} token into the rule. When the {*} rule is encountered, PGE calls the associated action method with the current match object as an argument. Let's take our persons_name rule from above, and sprinkle liberally with action calls:

 rule persons_name {
    {*} <first_name> {*} <last_name> {*}
 }

The first call to the action method contains an empty match object because the parser hasn't had a chance to match anything yet. The second call contains only the first name of the match. The third and final call contains both the matched first and last name. Notice that if the match fails halfway through, we still call the actions where we succeeded, but do not call the actions after the failure. So, if we try to match the string "Leia", the action is called before the name and after the first name. When the rule tries to match the last name, it fails because no last name is provided, and the third action method call is never made.

Alternations and Keys

In addition to sub-rules, groups, and quantifiers, we also are able to take alternations between options that are either-or. The vertical bar token "|" can be used to distinguish between options where only one may match, Here's an example:

 rule hero {
    ['Luke' | 'Leia'] 'Skywalker'
 }

This rule will match either "Luke Skywalker" or "Leia Skywalker" but won't match "Luke Leia Skywalker" or anything else, for that matter. With things like alternations, if we want to call an action method it's helpful to distinguish which combination we matched:

 rule hero {
    [
      'Luke' {*}    #= Luke
    | 'Leia' {*}    #= Leia
    ]
    'Skywalker'
 }

This is the same rule, except now it passes two arguments to its action method: the match object and the name of the person who got matched.

Warning: Left Recursion

Getting into all the nitty-gritty theory behind parsers is well beyond the scope of this book. However, there is one potential pitfall that developers should be made aware of that is not immediately obvious. Like functions in ordinary procedural or functional languages, the methods in the PGE parser grammar can call themselves recursively. Consider the following rules derived in part from the grammar for the C programming language:

 rule if_statement {
    'if' <condition> '{' <statement>* '}' <else_block>?
 }

 rule statement {
    <if_statement> | <expression>
 }

 rule else_block {
    'else' '{' <statements>* '}'
 }

Notice that an if_statement can contain a list of statements, and that each statement may itself be an if_statement? This is called recursion , and is part of where the "Recursive Descent" algorithm gets its name from.

Now, let's look at a more direct example of a comma-separated list of integer digits to form an array. We can define this recursively as follows:

 rule list {
     <list> ',' <digit> | <digit>
 }

The intention is that if there is only one digit, we match the second option in the alternation, and if there are more digits we can match them recursively in the first alternation. However, take a close look at the insidious result. The recursive descent parser enters the list rule. It's first option is to enter the list rule again, so it does. Recursive descent is a depth-first algorithm, and it will continue to descent down a particular path until it finds a successful match or a match failure. In this case, it matches list and then it matches list again, then it matches list again, and so on and so forth. What we have created is an infinite loop pattern called left recursion.

Left recursion is caused when the left-most item of the left-most alternation is a recursion. The rule above can be easily resolved by writing:

 rule list {
    <digit> | <list> ',' <digit>
 }

Or even

 rule list {
    <digit> ',' <list> | <digit>
 }

Both of these two options make sure the left-most item in our rule is not a recursion, therefore preventing left recursion.

Here is a more tricky example where the left recursion is hidden from view:

 rule term {
    <expression> '*' <term> | <digit>
 }

 rule expression {
    <term> '+' <expression> | <term>
 }

This is a very limited subset of mathematical equations that we might like to write, and even in this small subset we have this same problem: To match a term, the parser first tries to match an expression, which in turn matches a term and then an expression ...

Left recursion is not the only problem you can run into with a recursive descent grammar, but it's one that's likely going to come up relatively often for new language designers, and one that is not always likely to generate useful error messages.

Operator Precedence Parser

Places where there are lots of little tokens in a statement, and where there are lots of possible options that a top-down parser will have to attempt can become relatively inefficient using PCT's recursive descent parser. Specifically, mathematical expressions are very open-ended and have forms that are difficult to anticipate. Consider the expression:

 a + b * c + d

The recursive descent parser is going to have to recognize through significant trial and error how this statement should be parsed. For tasks like this, recursive descent parsers are not ideal, although a type of bottom-up parser called an operator precedence parser is. Operator precedence parsers work similarly to more versatile bottom-up parsers such as Lex or Yacc, but are optimized for use with expressions and equations. The "things" in an equation are split into two subtypes: terms and operators. Operators themselves are split into a number of types including prefix (-a), postfix (i++), infix (x + y), circumfix ([z]), postcircumfix (a[b]), and list (1, 2, 3). Each operator gets its own precedence number that specifies how closely it binds to the terms. In the example above, the expression is parsed

 a + (b * c) + d

This is because the * operator has a higher precedence and therefore binds more tightly then the + operator.

To switch from the top-down recursive descent parser to the bottom-up operator precedence parser, a rule must be defined that is an optable :

 rule expression is optable { ... }

The ... ellipses aren't an editorial shortcut, it's the Perl 6 operator that is used to define a function signature. The ... indicates that this is just a signature and that the actual guts of it will be filled in somewhere else. In this case, that "somewhere else" is in the definition of the optable role.

Protofunction Definitions

Protofunctions are used to define operators in the optable in the same way that rules and tokens are used throughout the rest of the grammar. A proto is a way of saying that the rule is overridable dynamically, and that it might be defined somewhere else. In this case, PCT takes information from the proto declaration and fills in the details for us. On another note, this also means that the HLL itself can modify its own grammar at run time, by overriding the proto definitions for its operator table. Some languages call this process "operator overloading".

A proto is defined like this, taking some of our grammar rules above:

 'proto' <proto_name> [ 'is' <property> ] '{' '...' '}'

The name of the operator, listed as <proto_name> above, contains both a location part and an identifier part. The location is one of the places where the operator can be located, such as infix, postfix, prefix, circumfix, and postcircumfix. The name of the operator is the symbol used for the operator in any of the quotes that Perl 6 understands:

 proto infix:<+>                  # a + b
 proto postfix:'--'               # i--
 proto circumfix:«<>»             # <x>

The is keyword defines a property of the rule. Some examples of this are:

 is precedence(1)     # Specifies an exact precedence
 is equiv('+')        # Has the same precedence as the "+" operator
 is assoc('right')    # Right associative. May also be "left" or "list"
 is pirop('add')      # Operands are passed to the PIR operator "and"
 is subname('mySub')  # Operands are passed to the function "mySub"
 is pasttype('if')    # Operands are passed as children to an "if" PAST node in
                      # the parse tree
 is parsed(&myRule)   # The token is parsed and identified using the rule
                      # "myRule" from the top-down parser

Protofunction definitions are function signatures which can be overridden via multimethod dispatch. This means functions can be written with the same name as the rule to implement the behavior of the operator:

 rule infix:"+" { ... }

And in a PIR file for built-in functions:

 .sub 'infix:+'
    .param pmc a
    .param pmc b
    .local pmc c
    c = a + b
    .return(c)
 .end

The question to ask then is "Why have an is subname() property, if all operators can be defined as subroutines?" The answer is that using the is subname() property allows PCT to call a subroutine of a different name then the operator. This is a good idea if there is already a built-in function in the language that duplicates the functionality of the operator. There is no sense duplicating functionality, is there?

The great thing about protos being overloadable is that you can specify different functions to call with different signatures:

 .sub 'infix:+' :multi('Integer', 'Integer')
    ...
 .end

 .sub 'infix:+' :multi('CLispRatio', 'Number')
    ...
 .end

 .sub 'infix:+' :multi('Perl6Double', 'PythonInteger')
    ...
 .end

This list can be a bit intimidating, and it's hard to imagine that it would be necessary to write up a new function to handle addition between every conceivable pair of operands. Fortunately for us all, this isn't the case because all these data types have those VTABLE interfaces that we can use. For most data types Parrot already has basic arithmetic operations built in, and it's only necessary to override for those data types with special needs. This example was only a demonstration of the flexibility of the method.

Hypothetical Variables

Hypothetical variables are a powerful way of building up data structures from within a match. Ordinary captures with () store the result of the captures in $1, $2, etc. The values stored in these variables will be kept if the match is successful, but thrown away if the match fails (hence the term "hypothetical"). The numbered capture variables are accessible outside the match, but only within the immediate surrounding lexical scope:

  "Zaphod Beeblebrox" ~~ m:w/ (\w+) (\w+) /;
  
  print $1; # prints Zaphod

You can also capture into any user-defined variable with the binding operator :=. These variables must already be defined in the lexical scope surrounding the rule:

  my $person;
  "Zaphod's just this guy." ~~ / ^ $person := (\w+) /;
  print $person; # prints Zaphod

Repeated matches can be captured into an array:

  my @words;
  "feefifofum" ~~ / @words := (f<-[f]>+)* /;
  # @words contains ("fee", "fi", "fo", "fum")

Pairs of repeated matches can be captured into a hash:

  my %customers;
  $records ~~ m:w/ %customers := [ E<lt>idE<gt> = E<lt>nameE<gt> \n]* /;

If you don't need the captured value outside the rule, use a $? variable instead. These are only directly accessible within the rule:

  "Zaphod saw Zaphod" ~~ m:w/ $?name := (\w+) \w+ $?name/;

A match of a named rule stores the result in a $? variable with the same name as the rule. These variables are also accessible only within the rule:

  "Zaphod saw Zaphod" ~~ m:w/ E<lt>nameE<gt> \w+ $?name /;

When a rule matches a sequence of input tokens, an associated method in NQP is called to convert that match into an AST node. This node is then inserted into the parse tree.

Basic Rules

Let's start off with a simple rule:

 rule persons_name {
    <first_name> <last_name>
 }

We also define the two name tokens as:

 token first_name { <alpha>+ }
 token last_name { <alpha>+ }

The special token <alpha> is a built-in construct that only accepts upper case and lower case letters. The "+" after the <alpha> tag is a short way of saying "one or more". Our rule persons_name would match something like Darth Vader Actually, it matches a lot of things that aren't people's namesbut wouldn't match something like C 3P0. Notice that the rule above would match Jar Jar Binks, but not the way you would expect: It would match the first "Jar" as <first_name> and the second "Jar" as <last_name> and wouldn't match "Binks" at all.

this example shows another new construct, the square brackets. Square brackets are ways to group things together. The star at the end means that we take all the things inside the brackets zero or more times. This is similar to the plus, except the plus matches one or more times. Notice, however, that the above rule always matches a comma at the end, so we would need to have something like:

 Darth Vader, Luke Skywalker,

Instead of something more natural like:

 Darth Vader, Luke Skywalker

We can modify the rule a little bit so that it always ends with a name instead of a comma:

 rule TOP {
    [ <persons_name> ',' ]* <persons_name>
 }

Now we don't need a trailing comma, but at the same time we can't match an empty file because it always expects to have at least one name at the end. If we still want to match empty files successfully, we need to make the whole rule optional:

 rule TOP {
    [ [ <persons_name> ',' ]* <persons_name> ]?
 }

We've grouped the whole rule together in another set of brackets, and put a "?" question mark at the end. The question mark means zero or one of the prior item.

The symbols "*" (zero or more), "+" (one or more) and "?" are called quantifiers, and allow an item in the rule to match a variable number of times. These aren't the only quantifiers, but they are the most common. We will talk about other quantifiers later on.