NAME

rx.ops - Parrot Regular Expression Engine, version 4.0

SYNOPSIS

  # NOTE: This looks a LOT scarier than it really is
  # "zzabbbcdcdcdzz" =~ /ab*[cd]+/

  # This represents an inlined, unoptimized regex

          set S0, "zzabbbcdcdzz"

  RX_0:
          set I0, 0
          set I1, 0
          branch start0
  advance:
          rx_advance S0, I0, fail
          set I1, I0
  start0:
          rx_literal S0, I1, "a", advance
  start1:
          rx_pushmark
  top1:
          rx_literal S0, I1, "b", start2
          rx_pushindex I1
          branch top1
  back1:
          rx_popindex I1, advance

  start2:
          rx_oneof S0, I1, "cd", back1
  top2:
          rx_oneof S0, I1, "cd", succeed
          branch top2

  succeed:
          print "Match"
          end
  fail:
          print "No Match"
          end

DESCRIPTION

The Perl 5 regular expression engine was state-of-the-art. It was the fastest and most featureful implementation available. Everybody used Perl 5's regular expression syntax wherever possible.

The Perl 5 regular expression engine was also a mess.

The engine was like a separate interpreter unto itself. Few understood its dark magic, and fewer worked on its baroque source. It was a black box, sealed off from the outside world with only a couple opcodes to show in other files. It was the slowest part of Perl to adapt to new features--it was one of the last to get threadsafety and full Unicode support--because so few people understood it. Larry Wall once said that three people understood the regex engine, give or take four.

Because of these issues, the design documents for Parrot called for regular expression opcodes to be built in to the interpreter. This group of opcodes, called the Parrot Regular Expression Engine version 4.0 (or simply Rx4), is the result.

Basic Concepts

Perl 5 had one opcode for each operation in the regular expression. For example:

        >perl -mre=debug -e '/ab+[cd]/'
        Compiling REx `ab+[cd]'
        size 15 first at 1
           1: EXACT <a>(3)
           3: PLUS(6)
           4:   EXACT <b>(0)
           6: ANYOF[cd](15)
          15: END(0)
        anchored `ab' at 0 floating `b' at 1..2147483647 (checking anchored) minlen 3
        Freeing REx: `ab+[cd]'

(The re pragma with the 'debug' switch displays the compiled version of the regex. The numbers in parenthesis represent where to jump to on success; 0 is a special value meaning "this part of the regex is done".)

In Rx4, that regular expression would be something like:

        advance:
                rx_advance S0, I0, fail
                set I1, I0
        start:
                rx_literal S0, I1, "ab", advance
                rx_pushmark
        top:
                rx_pushindex I1
                rx_literal S0, I1, "b", next
                branch top
        backtrack:
                rx_popindex I1, advance
        next:
                rx_oneof S0, I1, "cd", backtrack
                branch success

(In Rx4, the last parameter is a label to branch to on failure, not success.)

10 operations in Rx4 to 5 in Perl 5. I can already hear the cynicism: "how could that be BETTER?!?" Well, there's several reasons.

The first is that it frees us to use normal ops, and in fact they're used all the time. branch is a normal op; so is bsr, the normal way to call a subrule. Things like (?{CODE}) can be implemented with relative ease--simply put the normal opcodes in the appropriate place in the regex. If you're debugging a regex, you can simply sprinkle output messages liberally throughout the regex.

The second is opcode dispatch. Parrot has very fast opcode dispatch, and we can use that to our advantage.

Finally, there's the matter of optimizations. As an example, take /a+bc+/. The most efficient way to look for that is probably to look for the constant string 'abc' and expand outwards from there--especially if you use Boyer-Moore or another fast search algorithm. It means that the code generator can decide whether to optimize for success or failure, for compilation or execution speed. You get the idea.

Bottom line is, Rx4 lays out exactly what's going on. This is a feature. It gives the regex compiler total control over what's going on.

The Opcodes

There are two basic rules to how the opcodes operate.

The first rule is that the first argument to each opcode is the string we are matching against, and the second one is the current index in the string.

The second rule pertains to the ops that have an integer constant as their last parameter. For the most part, these ops will branch to that parameter if they 'fail'. For most ops, 'fail' means 'fail to match'.

If the documentation for an op doesn't specifically mention the first or last parameter, that's what they are.

The documentation for each opcode follows.

Preparation

rx_compile(out str, in str, in str): Provides a built-in regular expression compiler. The first parameter is set to the address of the newly-compiled regex, which can then be jsr'ed to; the second parameter is the regex itself; and the third parameter is the modifiers on the regex.; XXX Currently this op has not been implemented.

Stack manipulation ops

rx_pushindex(in int): Pushes the current index onto the stack contained in the info structure.
rx_initstack(): op rx_initstack() { CONTEXT(interpreter->ctx)->intstack = intstack_new(interpreter); goto NEXT(); }
rx_clearstack(): op rx_clearstack () { intstack_free(interpreter, CONTEXT(interpreter->ctx)->intstack); goto NEXT(); }; ########################################
rx_pushmark(): Pushes a 'mark' onto the stack contained in the info structure. Marks are used to indicate where one operation's backtrack information ends and another's begins.
rx_popindex(out int, labelconst int): Pops an index off the stack. If it pops a mark off instead, it branches to the second parameter.

Simple matching ops

This ops match a simple pattern against the string, and branch on their last argument when they fail.

rx_advance(in str, inout int, labelconst int): Increments the start index one character. Branches to the third parameter if it goes past the end of the string.; $2 is the current value of start_index.
rx_literal(in str, inout int, in str, labelconst int): Matches the exact string passed in the third parameter.; XXX Currently there is no way to do case-insensitive matches. The right way would involve either a specialized rx_literal_insensitive op, or some more sophisticated method.
rx_char(in str, inout int, in int, labelconst int): Matches the char $3 at the position $2 of string $1.
rx_is_w(in str, inout int, labelconst int): Matches a word character (usually \w).
rx_is_d(in str, inout int, labelconst int): Matches a number character (usually \d).
rx_is_s(in str, inout int, labelconst int): Matches a whitespace character (usually \s).
rx_oneof(in str, inout int, in str, labelconst int): Matches if the current character is one of the characters in the third parameter.; This op requires that its input be sorted for efficiency. Further, it requires that all ranges (a-z) be expanded by the regex compiler.
rx_oneof_bmp(in str, inout int, in pmc, labelconst int): This op has the exact same behavior as rx_oneof, except that the third parameter is a Pointer to a bitmap generated by rx_makebmp.
rx_dot(in str, inout int, labelconst int): Matches any character. This currently works exactly like rx_advance, but we leave it here in case they have to diverge in the future.
rx_zwa_boundary(in str, in int, labelconst int): Matches if the one of the previous character and the next character is a word character, and the other one is not (usually \b).
rx_zwa_atend(in str, in int, labelconst int): Matches at the end of the string.

Optimization ops: searching at the beginning of the string.

This ops combine rx_advance with other ops, with the goal of increasing the code density, and saving a few repeated computations.

This ops will not increment the value of start_index if the pattern matches at the current position of the string. So when the overall match fails, you\'ll have to increment start_index by hand before calling them.

Additionally, this op can be used to optimize matches of the type:

        .*? <pattern>  # non-greedy repetition of dot, followed by
                       # a known pattern

rx_search(in str, out int, inout int, in str, labelconst int): Searches for the literal $4 on the string $1 starting at $3. Sets $2 to the current index in the string (after the literal), and $3 to start_index.; Branches to $5 if the literal is not found.
rx_search_char (in str, inout int, inout int, in int, labelconst int): Searches for the char $4 on the string $1 starting at $3. Sets $2 to the current index in the string (after the char); Branches to $5 if the char is not found.; The char is expressed as an integer representing its codepoint.

Optimization ops: combining a simple match with a greedy star.

rx_literal_all(in str, inout int, in str): Matches greedily the repetition of the literal passed in the third parameter.; It never fails, and it doesn't save the intermediate points in the stack.; If you need to backtrack over rx_literal_all, you should manage it manually:; This is a bit clumsy, but saves a bunch of stack pushing.
rx_char_all(in str, inout int, in int): Matches the greedy repetition of char $3 at the position $2 of string $1.; It never fails, and it doesn't push intermediate positions in the stack, so if you need backtracking you'll have to manage it manually.
rx_oneof_bmp_all (in str, inout int, in pmc): Matches the greedy repetition of rx_one_of_bmp. Like its cousins, it never fails and it doesn't push the intermediate position in the stack

Other ops

rx_makebmp(out pmc, in str): This op pre-generates bitmaps to be used with rx_oneof_bmp, increasing performance. The first parameter will be set to a Pointer to the bitmap; the second parameter is the string to be bitmapped.; Note that bitmaps are currently NOT compatible with characters above 255 (as defined by whatever character set you're using). This may change in the future.

Using the opcodes

Tutorial

Let's see how simple regexes using the Rx4 engine. This examples will show inlined regexes (i.e., regexes that appear on the middle of perl code, of the kind that was so popular in the old perl5 days).

We won't deal then with named regular expressions (also known as rules) and the conventions used to call them (we expect that some form of the standard calling conventions will be used).

First of all, let's explain the concept behind the Rx4 ops. During the life-time of a match, we keep the state of the match in concrete, well-known (by the compiler) registers. There are at least three registers needed to save the state of match, which we will call S0, I0, I1 (there is no particular reason to use this three registers, the compiler can choose any registers of the right type that are free during the match, but we will use this ones in all our examples).

The purpose of this registers is:

S0:: Saves the string we are matching against
I0:: Saves the "start-of-match" position. The position in the string where the first character of the pattern matched.
I1:: Saves the current position in the string.

As we will see, most of the rx opcodes read or modify at least one of this registers. Sometimes, the compiler can decide to use some other registers, to save temporary information about the match (like the position of the beginning of a group, for example).

Let's start with a really simple regex. Imagine that we want to compile the code

  if (/^foobar/) { print 1 };

Now, this can done in a very simple way. Assuming that we have managed to put the string contents of $_ into S0, the code would be:

  start:
          set I0, 0
          rx_literal S0, I0, "foobar", end
          print 1
  end:

Let's look at this instructions one at-a-time.

  start:
          ## we start matching from the beginning of the match
          set I0, 0

          ## match the string "foobar" at the current position, or branch to end
          rx_literal S0, I0, "foobar", end

          ## we have matched, so let's print 1
          print 1
  end:

This was a very simple kind of pattern. Imagine now that we wanted to search for "foobar" in any point of the string. We would need to add some form of loop that iterates over the characters of the string, looking for the match "foobar". This example shows how:

          set I0, 0
  start:
          set I1, I0
          rx_literal, S0, I1, "foobar", advance
          print "match"
          branch end

  advance:
          rx_advance S0, I0, fail
          branch start

  fail:
          print "no match"

  end:

This looks very complicated, but it is not. The match starts at the position 0. It tries to match "foobar" at this position, and if it fails, is branches to rx_advance, where it will increase the start-of-match position by one and try again. When there are no characters left, rx_advance branches to $fail.

This is the skeleton of how regexes work in general. The main loop iterates over I0 (start-of-match) while the inner ops update I1 (match-position).

For this particular case, Rx4 offers an specialized op that can be used to optimize the computation. rx_search merges the functionality of rx_advance and rx_literal in one op.

Using this new op, the code to match would look like:

          set I0, 0
  start:
          rx_search, S0, I1, I0, "foobar", fail
          print "match"
          branch end
  fail:
          print "nomatch"
  end:

This is very useful, since many matches start with a literal. Rx4 offers also rx_search_char which searches for a single char, and rx_search_bmp which searches for a character class.

From here it is mostly a matter of knowing how to implement the various forms of quantifiers and groups. The following section explains how the most common constructs can be implemented.

Common constructs

The list below gives simple templates for common quantifiers operations.

(This templates could be heavily optimized in the particular case that "x" is a literal. But that's not the point here.)

x*
x+
x?
x*?
x+?
x??
x|y|z
x{m,n}

XXX Finish this documentation.

BUGS

This code currently requires everything to be in an eight-bit encoding compatible with ASCII.

A lot of ops are missing, features such as $$, ^^, etc. are not implemented.

There is not a convenient way of doing case-insensitive matches.

The implementation of perl6 advanced features such as subrules, or hypothetical variables has not been resolved.

There are undoubtedly many more in code this complicated.

AUTHORS

Initial version by Brent Dax <brentdax@cpan.org>; special thanks to Angel Faus <afaus@corp.vlex.com> and Jeff 'japhy' Pinyan <japhy@pobox.com> for major help, especially with decisions on the architecture of the engine.

parrotcode: Parrot Regular Expression Engine, version 4.0
Contents \| Ops