Grammar Engine

The Parrot Grammar Engine (PGE) is a parser generator, one of the key components of the Parrot Compiler Toolkit. It reads grammar files written in the PGE rules format and generates parser modules written in PIR code. PGE rules provide the full power of recursive descent parsing and operator precedence parsing. Fortunately, you don't need to know what those terms mean in order to make good use of PGE. We'll introduce the necessary concepts as we talk about various features in this chapter.

Grammars

The ultimate goal of a parser is to match patterns in a source language and convert them to an internal data structure for later manipulations. As a programmer, you're probably already familiar with some of these types of patterns: function declarations, function calls, statements, and assignments. Each of these different concepts have a particular form called a syntax. In C for example, the syntax to define a function looks something like this:

  <return_type> <function_name> ( <arguments> ) { <function_body> }

Things that fit this pattern, so long as all the sub-patterns use the proper syntax also, are valid subroutines in C. Similarly, we can use a slightly different pattern to create a subroutine:

  sub <function_name> { <function_body> }

A grammar is a collection of rules like the ones above that specify all the acceptable patterns in a language. Grammars group together these rules in much the same way that a class groups together related data fields and methods In languages like Perl 6 for instance, a grammar is just a special kind of class and a rule is just a special kind of method.. Each rule defines a pattern for matching one unit of text, and can be made up of various other rules which are called recursively to make a complete match.

A rule can contain regular expressions to match patterns of characters:

  rule id { \d+ }

A rule can also contain patterns of references to other rules:

  rule record { <id> <name> <phone> }

A grammar contains a group of rules that work together to match the entire language:

  grammar Contacts;

  rule name { 'John' | 'Bob ' | 'Fred' }

  rule id   { \d+ }

  rule record { <id> <name> }

  ...

Rules and Tokens

There are two different kinds of rules: rule, which we saw above, and token. A rule performs smart whitespace matching between the various pieces of the pattern. The record rule given previously would match

    6355 John

or

    6355        John

but not

    6355John

A token matches whitespace only if you specifically request it. To get the same effect with a token, add the \s (match a space character) and + (match the preceding atom -- the space character, in this case -- one or more times) pattern to the rule:

  token record { <id> \s+ <name> }

The Start Rule

A recursive descent parser is what's called a top-down parser. It starts at the highest-level rule, called TOP, and works its way down through individual rules to match an entire string or file. Real Perl 6 allows any name for the top-level rule, but PCT expects a rule called TOP. If PCT was as fully-featured as Perl 6, people would use it instead! Here's an example of a TOP rule:

  rule TOP { <record> }

This rule matches a single record pattern in a string or file. Once the parser has succeeded in matching the entire string or file passed to the start rule, it returns a parse tree. If it cannot match the entire input with the rules provided, it can either return a partial match, or it can throw a parse error.

Testing a Grammar

Let's do a small example grammar. Save this example to a file called Contacts.pg:

  grammar Contacts is PGE::Grammar;

  rule  TOP    { <record> }
  rule  record { <id> <name> }
  token name   { 'John' | 'Bob ' | 'Fred' }
  token id     { \d+ }

Then compile the grammar:

  $ parrot Perl6Grammar.pbc --output=Contacts.pir Contacts.pg

Assume an installed Parrot for all examples? Anyone working from the source tree should be able to mangle paths appropriately.

The path to parrot and to the Perl6Grammar.pbc file will vary on different systems. If you compiled Parrot from source, it will be:

  $ ./parrot runtime/parrot/library/PGE/Perl6Grammar.pbc \
        --output=Contacts.pir Contacts.pg

Next, create a small PIR script to run your grammar. Save it as grammar_test.pir:

  .sub main :main
      load_bytecode 'PGE.pbc'        # load some required modules
      load_bytecode 'dumper.pbc'
      load_bytecode 'PGE/Dumper.pbc'

      load_bytecode 'Contacts.pir'   # load your grammar

      .local string source
      source  = "3 John"

      .local pmc top, grammar, match
      top     = get_hll_global ['Contacts'], 'TOP'
      grammar = get_class 'Contacts'
      match   = top(source, 'grammar' => grammar)

      _dumper(match, "match")
  .end

Run the test script:

  $ parrot grammar_test.pir

It will print out a text representation of the raw parse tree stored in the match variable:

  "match" => PMC 'Contacts' => "3 John" @ 0 {
      <record> => PMC 'Contacts' => "3 John" @ 0 {
          <id> => PMC 'Contacts' => "3" @ 0
          <name> => PMC 'Contacts' => "John" @ 2
      }
  }

Each node in the tree corresponds to a rule in the grammar. The top-level match variable contains one child named record, which contains two children named id and name. id contains the number 3, and name contains the string "John". This is exactly what the simple grammar should have matched.

Rule Syntax

Every language has a set of basic components (words or parts of words) and syntax conventions for combining them. The "words" in rules are literal characters or symbols, some metacharacters (or metasymbols), and escape sequences, while the combining syntax includes other metacharacters, quantifiers, bracketing characters, and assertions.

Metacharacters

The . metacharacter matches any single character, even a newline character. The ^ and $ metacharacters are zero-width matches which represent the beginning and end of a string. They each have doubled alternates ^^ and $$ that match at the beginning and end of every (newline-delimited) line within a string.

The |, &, \, #, and := metacharacters are all syntax structure elements. | alternates between two options. & matches two patterns simultaneously (the patterns must be the same length). \ turns literal characters into metacharacters (producing escape sequences). # starts a comment which proceeds until the end of the line. You can start a comment at any point on any line in a rule. := binds a hypothetical variable to the result of a subrule or grouped pattern (see "Hypothetical Variables").

The metacharacters (), [], {} and <> are bracketing pairs. Bracketing pairs must always be balanced within the rule; to use a literal character, escape it with a \. The () and [] pairs group patterns as a single atom. They often capture a result, mark the boundaries of an alternation, or mark a group of patterns with a quantifier. Parentheses () capture, but square brackets [] do not. The {} brackets define a section of code (a closure) within a rule. These closures are always a successful zero-width match. The <...> brackets mark assertions, which handle a variety of constructs including character classes and user-defined quantifiers (see Assertions).

Table 7-2 summarizes the basic metacharacters.

Escape Sequences

Escape sequences are literal characters acting as metacharacters. A preceding backslash (\) identifies them as escapes. Some escape sequences represent single characters that are difficult to represent literally, such as \t for tab, or \x[...] to specify a character by its hexadecimal number. Some represent limited character classes, such as \d for digits or \w for word characters. Some represent zero-width positions in a match, such as \b for a word boundary.

If you've used Perl 5 regexps, you may remember the \Q escape sequence which treats everything until the following \E sequence as literal text, containing no escape sequences. Because ordinary variables now interpolate as literal strings by default, the \Q escape sequence is rarely needed.

CHP-7-TABLE-3Table 7-3 shows the escape sequences for rules.

Quantifiers

Quantifiers specify the number of times an atom (a single character, metacharacter, escape sequence, grouped pattern, assertion, etc) will match.

The numeric quantifiers use assertion syntax. A single number (<3>) requires exactly that many matches. A numeric range quantifier (<3..5>) succeeds if the number of matches is between the minimum and maximum numbers, inclusive. A range with three trailing dots (<2...>) is shorthand for <n..Inf>; it matches as many times as possible.

Each quantifier has a minimal alternate form -- marked with a trailing ? -- which matches the shortest possible sequence first. That is, given the string aaaaaa, a<3..5> will match aaaaa and a<3..5>? will match aaa.

CHP-7-TABLE-4Table 7-4 shows the built-in quantifiers.

Assertions

An assertion states that some condition or state is true. The match fails when that assertion is false.

Assertions match named and anonymous rules, arrays or hashes containing anonymous rules, and subroutines or closures that return anonymous rules.

To interpolate a variable in assertion rules, enclose it in assertion delimiters. A bare scalar in a pattern interpolates as a literal string, while a scalar variable in assertion brackets interpolates as an anonymous rule. A bare array in a pattern matches as a series of alternate literal strings, while an array in assertion brackets interpolates as a series of alternate anonymous rules.

A bare hash in a pattern matches a word (\w+) if and only if that word is one of its keysThe effect is similar to matching the keys as a series of alternates, but it prefers to match the longest possible key, instead of the first potential match., while a hash in assertion brackets also matches the associated value as an anonymous rule.

A bare closure in a pattern always matches (unless it calls fail), but a closure in assertion brackets <{...}> must return an anonymous rule to match.

An assertion with parentheses <(...)> resembles a bare closure in a pattern in that it allows you to include Perl code within a rule. <(...)> evaluates the return value of the closure in boolean context. The match succeeds or fails based on that return value.

Assertions match character classes, both named and enumerated. A named rule character class is often more accurate than an enumerated character class. The common <[a-zA-Z]> idiom matches ASCII alphabetic characters, but the more comprehensive built-in rule <alpha> matches the full set of Unicode alphabetic characters.

CHP-7-TABLE-5Table 7-5 shows the syntax of assertions.

Modifiers

Modifiers alter the meaning of a pattern. The standard position for modifiers is at the beginning of the rule, right after the m, s, or rx, or after the name in a named rule. Modifiers cannot attach to the outside of a bare /.../. For example:

  m:i/marvin/ # case insensitive
  rule names :i { marvin | ford | arthur }

You may group single-character modifiers, but you must separate longer modifiers by colons:

  m:wig/ zaphod /                        # OK
  m:words:ignorecase:globally / zaphod / # OK
  m:wordsignorecaseglobally / zaphod /   # Not OK

Most modifiers can also appear inside the rule when attached to rule or grouping delimiters. Internal modifiers are lexically scoped to their enclosing delimiters, so can alter subpatterns:

  m/:w I saw [:i zaphod] / # only 'zaphod' is case insensitive

The repetition modifiers (:Nx, :Nth, :once, :globally, and :exhaustive) and the continue modifier (:cont) alter the return value of the rule as a whole, so you cannot use them lexically inside a rule.

The :Nx modifier matches the rule a specific number of times. If the modifier expects more matches than the string has, the match fails. Its alternate form (:x(N)) can take a variable in place of the number.

The :once modifier on a rule only allows it to match once. The rule will not match again until the you call the .reset method on the rule object.

The :globally modifier matches as many times as possible. The :exhaustive modifier also matches as many times as possible, in as many different ways as possible.

The :Nth modifier preserves one result from a particular counted match. If the rule matches fewer times than the modifier expects, the match fails. It has several alternate forms. One form, :th(N), takes a variable in place of the number. The other forms -- :Nst, :Nnd, and :Nrd -- allow you to write more naturally :1st, :2nd, :3rd. The other way is valid as well; choose whichever is most comfortable.

By default, rules ignore literal whitespace within the pattern. The :w modifier makes rules sensitive to literal whitespace, but in an intelligent way. Any cluster of literal whitespace acts like an explicit \s+ when it separates two identifiers and \s* everywhere else.

No modifiers exist to treat the matched string as a single line or multiple lines. Instead, use the "beginning of string" and "end of string" or "beginning of line" and "end of line" metacharacters.

CHP-7-TABLE-6Table 7-6 lists the available modifiers.

Built-in Rules

PGE provides several named rules, including a complete set of POSIX-style classes, and Unicode property classes. The list isn't fully defined yet, but CHP-7-TABLE-7Table 7-7 shows a few you're likely to see.

The <null> rule matches a zero-width string (it always matches) and <prior> matches whatever the most recent successful rule matched. These replace the two behaviors of the Perl 5 null pattern //, which is no longer valid syntax for rules.

Backtracking Control

Whenever part of the pattern fails to match, PGE performs backtracking -- backing up to the previous point at which the match could succeed and trying again. You can explicitly trigger backtracking by calling the fail function within a closure. CHP-7-TABLE-8Table 7-8 displays metacharacters and built-in rules relevant to backtracking.

This could use an example.

Calling Actions

Once the parser has matched the entire input a source code file, or a line of input at the terminal in interactive mode the parse has succeeded. The generated AST is now available to the code generator for conversion into PIR.

Please review. The forward declaration is awkward here, but a little bit of explanation might ameliorate this.

This AST gets built up by actions -- code snippets attached to rules and tokens. To call an action, insert the {*} token into the rule. When PGE encounters {*}, it will call the associated action method with the current match object as an argument.

The best way to demonstrate this is by example. Sprinkle the persons_name rule liberally with action calls:

 rule persons_name {
    {*} <first_name> {*} <last_name> {*}
 }

The first call to the action method contains an empty match object because the parser hasn't matched anything yet. The second call contains only the first name of the match. The third and final call contains both the matched first and last name.

If the match fails halfway through, PGE will still call the actions that have succeeded; it will not call the actions after the failure. If you try to match the string "Leia", PGE will call the first two action methods. When the rule tries to match the last name, it fails, and PGE will not call the third action method.

Alternations and Keys

In addition to sub-rules, groups, and quantifiers, you can also express either-or alternations between options. The vertical bar token (|) distinguishes between options where only one may match:

 rule hero {
    ['Luke' | 'Leia'] 'Skywalker'
 }

This rule will match either "Luke Skywalker" or "Leia Skywalker" but won't match "Luke Leia Skywalker"nor anything else.. Given alternations and action methods, it's often important to distinguish which alternation matched:

 rule hero {
    [
      'Luke' {*}    #= Luke
    | 'Leia' {*}    #= Leia
    ]
    'Skywalker'
 }

This is the same rule, except now it passes two arguments to its action method: the match object and the name of the person who matched.

Warning: Left Recursion

If you've worked with parsers before, you may have seen this coming. If not, don't fear. Like functions in ordinary procedural or functional languages, the methods in the PGE parser grammar can call themselves recursively. Consider some rules derived in part from the grammar for the C programming language:

 rule if_statement {
    'if' <condition> '{' <statement>* '}' <else_block>?
 }

 rule statement {
    <if_statement> | <expression>
 }

 rule else_block {
    'else' '{' <statements>* '}'
 }

An if_statement can contain a list of statements, and that each statement may itself be an if_statement. This is recursion ; it's one of the reasons PGE is a "Recursive descent" parser.

Consider the more direct example of a comma-separated list of integer digits which form a list. A recursive definition might be:

 rule list {
     <list> ',' <digit> | <digit>
 }

If there is only one digit, the second option in the alternation matches. If there are multiple digits, recursion will match them through the first alternation.

That's the intention. The results are insidious.

The recursive descent parser enters the list rule. Its first option is to enter the list rule again, so it does. Recursive descent is a depth-first algorithm; PGE will continue to descend down a particular path until it finds a successful match or a match failure. In this case, it matches list, then it matches list again, then it matches list again, and so on. This rule forms an infinite loop -- a pattern called left recursion. The problem is that the left-most item of the left-most alternation is itself a recursion.

The rule above does not recurse infinitely when rewritten as:

 rule list {
    <digit> | <list> ',' <digit>
 }

... or even:

 rule list {
    <digit> ',' <list> | <digit>
 }

Both options ensure that the left-most item in the rule is recursive.

Left recursion may be trickier. It's not immediately obvious in this grammar:

 rule term {
    <expression> '*' <term> | <digit>
 }

 rule expression {
    <term> '+' <expression> | <term>
 }

Even this common, limited subset of mathematical equations has the same problem. To match a term, the parser first tries to match an expression, which in turn matches a term and then an expression ....

Again, the solution is simple. Rewrite at least one of the rules so that the first condition it tries to match is not itself a recursive situation.

Operator Precedence Parser

Recursive descent parsing can be inefficient where statements have lots of little tokens and many possible options to match. For example, mathematical expressions are very open-ended, with many valid forms which are difficult to anticipate. Consider the expression:

 a + b * c + d

A recursive descent parser will undergo significant trial and error to parse this statement. Recursive descent parsing is not ideal for these situations. Instead, a type of bottom-up parser called an operator precedence parser is much better.

Is this a categorization of all opps or just PGE's opp?

Operator precedence parsers work similarly to more versatile bottom-up parsers such as Lex or Yacc, but are optimized for use with expressions and equations. Equations have two subtypes, terms and operators. Operators themselves have several subtypes, including prefix (-a), postfix (i++), infix (x + y), circumfix ([z]), postcircumfix (a[b]), and list (1, 2, 3). Each operator gets its own precedence number that specifies how closely it binds to the terms. The previous example should parse as:

 a + (b * c) + d

... because the * operator has a higher precedence -- binding more tightly to its terms -- than the + operator.

Within a grammar, switch from the top-down recursive descent parser to the bottom-up operator precedence parser with an optable rule:

 rule expression is optable { ... }

The ... ellipsis isn't an editorial shortcut, it's the Perl 6 operator to to define a function signature. The ... indicates that this is just a signature; the actual implementation is elsewhere. In this case, that location in the definition of the optable.

Protofunction Definitions

Protofunctions define operators in the optable in the same way that rules and tokens make up the grammar. A proto declares a rule, defined elsewhere, which other code may override dynamically. In this case, PCT takes information from the proto declaration and fills in the details. The "dynamic overriding" implies that a high-level language itself can modify its own grammar at run time, by overriding the proto definitions for its operator table. Some languages call this process operator overloading.

A proto definition resembles:

 'proto' <proto_name> [ 'is' <property> ] '{' '...' '}'

The name of the operator, noted as <proto_name>, contains both a location part and an identifier part. The location is the type of the operator, such as infix, postfix, prefix, circumfix, and postcircumfix. The name of the operator is the symbol used for the operator in any of the quotes that Perl 6 understands:

 proto infix:<+>                  # a + b
 proto postfix:'--'               # i--
 proto circumfix:«<>»             # <x>

The is keyword defines a property of the rule. Examples include:

 is precedence(1)     # Specifies an exact precedence
 is equiv('+')        # Has the same precedence as the "+" operator
 is assoc('right')    # Right associative. May also be "left" or "list"
 is pirop('add')      # Operands are passed to the PIR operator "and"
 is subname('mySub')  # Operands are passed to the function "mySub"
 is pasttype('if')    # Operands are passed as children to an "if" PAST node in
                      # the parse tree
 is parsed(&myRule)   # The token is parsed and identified using the rule
                      # "myRule" from the top-down parser

Please review.

Protofunction definitions are function signatures; you can override them with multimethod dispatch. This means that you can write functions with the same name as the rule to implement the behavior of the operator. Here's a proto:

 rule infix:"+" { ... }

... and its corresponding PIR rule:

 .sub 'infix:+'
    .param pmc a
    .param pmc b
    .local pmc c
    c = a + b
    .return(c)
 .end

You may ask "Why have an is subname() property, if you can define all operators as subroutines?" Using the is subname() property allows PCT to call a subroutine of a different name then the operator. This is a good idea if there is already a built-in function in the language that duplicates the functionality of the operator. There is no sense in duplicating behavior.

The great thing about protos being overloadable is that you can specify different functions to call with different signatures:

 .sub 'infix:+' :multi('Integer', 'Integer')
    #...
 .end

 .sub 'infix:+' :multi('CLispRatio', 'Number')
    #...
 .end

 .sub 'infix:+' :multi('Perl6Double', 'PythonInteger')
    #...
 .end

This list can be a bit intimidating, and it's hard to imagine that it would be necessary to write up a new function to handle addition between every conceivable pair of operands. Fortunately, this is rarely the case in Parrot, because all these data types support common the VTABLE interface. For most data types Parrot already has basic arithmetic operations built in, and it's only necessary to override for those data types with special needs.

Hypothetical Variables

Hypothetical variables are a powerful way of building up data structures from within a match. Ordinary captures with () store the result of the captures in $1, $2, etc. PGE stores values in these variables if the match is successful, but throws them away if the match fails. The numbered capture variables are accessible outside the match, but only within the immediate surrounding lexical scope:

  "Zaphod Beeblebrox" ~~ m:w/ (\w+) (\w+) /;

  print $1; # prints Zaphod

You can also capture into any user-defined variable with the binding operator := -- if you have declared these variables in a lexical scope enclosing the rule:

  my $person;
  "Zaphod's just this guy." ~~ / ^ $person := (\w+) /;
  print $person; # prints Zaphod

You may capture repeated matches into an array:

  my @words;
  "feefifofum" ~~ / @words := (f<-[f]>+)* /;
  # @words contains ("fee", "fi", "fo", "fum")

You may capture pairs of repeated matches into a hash:

  my %customers;
  $records ~~ m:w/ %customers := [ E<lt>idE<gt> = E<lt>nameE<gt> \n]* /;

If you don't need the captured value outside the rule, use a $? variable instead. These are only directly accessible within the rule:

  "Zaphod saw Zaphod" ~~ m:w/ $?name := (\w+) \w+ $?name/;

A match of a named rule stores the result in a $? variable with the same name as the rule. These variables are also accessible only within the rule:

  "Zaphod saw Zaphod" ~~ m:w/ E<lt>nameE<gt> \w+ $?name /;

This next paragraph feels out of place; is there more?

When a rule matches a sequence of input tokens, PCT calls an associated method within NQP to convert that match into an AST node, which it inserts into the parse tree.

Basic Rules

Consider the simple example rule:

 rule persons_name {
    <first_name> <last_name>
 }

... and two example tokens:

 token first_name { <alpha>+ }
 token last_name  { <alpha>+ }

The special token <alpha> is a built-in construct that only accepts upper case and lower case letters. The + after the <alpha> tag is a short way of saying "one or more". The rule will match names like Darth VaderIt also matches many strings that aren't real names, but won't match something like C 3P0.

This rule will match Jar Jar Binks, but not as you might expect: way you would expect: It would match the first "Jar" as <first_name>, the second "Jar" as <last_name>, and ignore "Binks"You should ignore the whole thing..

The rest seems vestigial. An example like this should precede the rest of the chapter. There are forward references, but it's a decent overview for people who haven't used similar systems before -- if you avoid going out in the weeds.

this example shows another new construct, the square brackets. Square brackets are ways to group things together. The star at the end means that we take all the things inside the brackets zero or more times. This is similar to the plus, except the plus matches one or more times. Notice, however, that the above rule always matches a comma at the end, so we would need to have something like:

 Darth Vader, Luke Skywalker,

Instead of something more natural like:

 Darth Vader, Luke Skywalker

We can modify the rule a little bit so that it always ends with a name instead of a comma:

 rule TOP {
    [ <persons_name> ',' ]* <persons_name>
 }

Now we don't need a trailing comma, but at the same time we can't match an empty file because it always expects to have at least one name at the end. If we still want to match empty files successfully, we need to make the whole rule optional:

 rule TOP {
    [ [ <persons_name> ',' ]* <persons_name> ]?
 }

We've grouped the whole rule together in another set of brackets, and put a "?" question mark at the end. The question mark means zero or one of the prior item.

The symbols "*" (zero or more), "+" (one or more) and "?" are called quantifiers, and allow an item in the rule to match a variable number of times. These aren't the only quantifiers, but they are the most common. We will talk about other quantifiers later on.