parrotcode: Parrot Regular Expression Engine, version 4.0 | |
Contents | Ops |
rx.ops - Parrot Regular Expression Engine, version 4.0
# NOTE: This looks a LOT scarier than it really is
# "zzabbbcdcdcdzz" =~ /ab*[cd]+/
# This represents an inlined, unoptimized regex
set S0, "zzabbbcdcdzz"
RX_0:
set I0, 0
set I1, 0
branch start0
advance:
rx_advance S0, I0, fail
set I1, I0
start0:
rx_literal S0, I1, "a", advance
start1:
rx_pushmark
top1:
rx_literal S0, I1, "b", start2
rx_pushindex I1
branch top1
back1:
rx_popindex I1, advance
start2:
rx_oneof S0, I1, "cd", back1
top2:
rx_oneof S0, I1, "cd", succeed
branch top2
succeed:
print "Match"
end
fail:
print "No Match"
end
The Perl 5 regular expression engine was state-of-the-art. It was the fastest and most featureful implementation available. Everybody used Perl 5's regular expression syntax wherever possible.
The Perl 5 regular expression engine was also a mess.
The engine was like a separate interpreter unto itself. Few understood its dark magic, and fewer worked on its baroque source. It was a black box, sealed off from the outside world with only a couple opcodes to show in other files. It was the slowest part of Perl to adapt to new features--it was one of the last to get threadsafety and full Unicode support--because so few people understood it. Larry Wall once said that three people understood the regex engine, give or take four.
Because of these issues, the design documents for Parrot called for regular expression opcodes to be built in to the interpreter. This group of opcodes, called the Parrot Regular Expression Engine version 4.0 (or simply Rx4), is the result.
Perl 5 had one opcode for each operation in the regular expression. For example:
>perl -mre=debug -e '/ab+[cd]/'
Compiling REx `ab+[cd]'
size 15 first at 1
1: EXACT <a>(3)
3: PLUS(6)
4: EXACT <b>(0)
6: ANYOF[cd](15)
15: END(0)
anchored `ab' at 0 floating `b' at 1..2147483647 (checking anchored) minlen 3
Freeing REx: `ab+[cd]'
(The re
pragma with the 'debug' switch displays the compiled version of the regex. The numbers in parenthesis represent where to jump to on success; 0 is a special value meaning "this part of the regex is done".)
In Rx4, that regular expression would be something like:
advance:
rx_advance S0, I0, fail
set I1, I0
start:
rx_literal S0, I1, "ab", advance
rx_pushmark
top:
rx_pushindex I1
rx_literal S0, I1, "b", next
branch top
backtrack:
rx_popindex I1, advance
next:
rx_oneof S0, I1, "cd", backtrack
branch success
(In Rx4, the last parameter is a label to branch to on failure, not success.)
10 operations in Rx4 to 5 in Perl 5. I can already hear the cynicism: "how could that be BETTER?!?" Well, there's several reasons.
The first is that it frees us to use normal ops, and in fact they're used all the time. branch
is a normal op; so is bsr
, the normal way to call a subrule. Things like (?{CODE})
can be implemented with relative ease--simply put the normal opcodes in the appropriate place in the regex. If you're debugging a regex, you can simply sprinkle output messages liberally throughout the regex.
The second is opcode dispatch. Parrot has very fast opcode dispatch, and we can use that to our advantage.
Finally, there's the matter of optimizations. As an example, take /a+bc+/
. The most efficient way to look for that is probably to look for the constant string 'abc' and expand outwards from there--especially if you use Boyer-Moore or another fast search algorithm. It means that the code generator can decide whether to optimize for success or failure, for compilation or execution speed. You get the idea.
Bottom line is, Rx4 lays out exactly what's going on. This is a feature. It gives the regex compiler total control over what's going on.
There are two basic rules to how the opcodes operate.
The first rule is that the first argument to each opcode is the string we are matching against, and the second one is the current index in the string.
The second rule pertains to the ops that have an integer constant as their last parameter. For the most part, these ops will branch to that parameter if they 'fail'. For most ops, 'fail' means 'fail to match'.
If the documentation for an op doesn't specifically mention the first or last parameter, that's what they are.
The documentation for each opcode follows.
rx_compile
(out str, in str, in str)jsr
'ed to; the second parameter is the regex itself; and the third parameter is the modifiers on the regex.rx_pushindex
(in pmc)rx_pushmark
()rx_popindex
(out int, labelconst int)This ops match a simple pattern against the string, and branch on their last argument when they fail.
rx_advance
(in str, inout int, labelconst int)rx_literal
(in str, inout int, in str, labelconst int)rx_char
(in str, inout int, in int, labelconst int)rx_is_w
(in str, inout int, labelconst int)\w
).rx_is_d
(in str, inout int, labelconst int)\d
).rx_is_s
(in str, inout int, labelconst int)\s
).rx_oneof
(in str, inout int, in pmc, labelconst int)a-z
) be expanded by the regex compiler.rx_oneof_bmp
(in str, inout int, in pmc, labelconst int)rx_oneof
, except that the third parameter is a Pointer to a bitmap generated by rx_makebmp
.rx_dot
(in str, inout int, labelconst int)rx_zwa_boundary
(in str, in int, labelconst int)\b
).rx_zwa_atend
(in str, in int, labelconst int)This ops combine rx_advance with other ops, with the goal of increasing the code density, and saving a few repeated computations.
This ops will not increment the value of start_index if the pattern matches at the current position of the string. So when the overall match fails, you\'ll have to increment start_index by hand before calling them.
Additionally, this op can be used to optimize matches of the type:
.*? <pattern> # non-greedy repetition of dot, followed by
# a known pattern
rx_search
(in str, out int, inout int, in str, labelconst in)rx_search_char
(in str, out int, inout int, in str, labelconst in)rx_literal_all
(in str, inout int, in str) set I2, I1 # save the start point
rx_literal_all S0, I1, "lit" # lit *
set I3, I1
again: .. do stuff ..
backtrack_rx_literal:
lte I3, I2, fail
dec I1, I3, 3 # 3 is the length of "lit"
fail:
# lit* has failed
rx_char_all
(in str, inout int, in int)rx_oneof_bmp_all
(in str, inout in, in pmc, inconst int)rx_makebmp
(out pmc, in str)rx_oneof_bmp
, increasing performance. The first parameter will be set to a Pointer to the bitmap; the second parameter is the string to be bitmapped.Let's see how simple regexes using the Rx4 engine. This examples will show inlined regexes (i.e., regexes that appear on the middle of perl code, of the kind that was so popular in the old perl5 days).
We won't deal then with named regular expressions (also known as rules) and the conventions used to call them (we expect that some form of the standard calling conventions will be used).
First of all, let's explain the concept behind the Rx4 ops. During the life-time of a match, we keep the state of the match in concrete, well-known (by the compiler) registers. There are at least three registers needed to save the state of match, which we will call S0, I0, I1 (there is no particular reason to use this three registers, the compiler can choose any registers of the right type that are free during the match, but we will use this ones in all our examples).
The purpose of this registers is:
S0
:I0
:I1
:As we will see, most of the rx opcodes read or modify at least one of this registers. Sometimes, the compiler can decide to use some other registers, to save temporary information about the match (like the position of the beginning of a group, for example).
Let's start with a really simple regex. Imagine that we want to compile the code
if (/^foobar/) { print 1 };
Now, this can done in a very simple way. Assuming that we have managed to put the string contents of $_ into S0, the code would be:
start:
set I0, 0
rx_literal S0, I0, "foobar", end
print 1
end:
Let's look at this instructions one at-a-time.
start:
## we start matching from the beginning of the match
set I0, 0
## match the string "foobar" at the current position, or branch to end
rx_literal S0, I0, "foobar", end
## we have matched, so let's print 1
print 1
end:
This was a very simple kind of pattern. Imagine now that we wanted to search for "foobar" in any point of the string. We would need to add some form of loop that iterates over the characters of the string, looking for the match "foobar". This example shows how:
set I0, 0
start:
set I1, I0
rx_literal, S0, I1, "foobar", advance
print "match"
branch end
advance:
rx_advance S0, I0, fail
branch start
fail:
print "no match"
end:
This looks very complicated, but it is not. The match starts at the position 0. It tries to match "foobar" at this position, and if it fails, is branches to rx_advance, where it will increase the start-of-match position by one and try again. When there are no characters left, rx_advance branches to $fail.
This is the skeleton of how regexes work in general. The main loop iterates over I0 (start-of-match) while the inner ops update I1 (match-position).
For this particular case, Rx4 offers an specialized op that can be used to optimize the computation. rx_search
merges the functionality of rx_advance
and rx_literal
in one op.
Using this new op, the code to match would look like:
set I0, 0
start:
rx_search, S0, I1, I0, "foobar", fail
print "match"
branch end
fail:
print "nomatch"
end:
This is very useful, since many matches start with a literal. Rx4 offers also rx_search_char
which searches for a single char, and rx_search_bmp
which searches for a character class.
From here it is mostly a matter of knowing how to implement the various forms of quantifiers and groups. The following section explains how the most common constructs can be implemented.
The list below gives simple templates for common quantifiers operations.
(This templates could be heavily optimized in the particular case that "x" is a literal. But that's not the point here.)
x*
start:
rx_pushmark
loop:
rx_pushindex I1
rx_literal S0, I1, "x", next
branch loop
back:
rx_popindex I1, lastback
x+
start:
rx_literal S0, I1, "x", lastback
rx_pushmark
loop:
rx_pushindex I1
rx_literal S0, I1, next
branch loop
back:
rx_popindex I1, lastback
x?
# this could be done using the stack, but this way
# is slightly faster
# I2 is used to check if we cannot backtrack more
start:
set I2, I1
set I3, 1
rx_literal S0, I1, "x", next
set I3, 0
branch next
back:
if I3, lastback
set I3, 1
set I1, I2
branch next
x*?
start:
set I2, I1
branch next
back:
set I1, I2
rx_literal S0, I1, "x", lastback
branch start
x+?
start:
rx_literal S0, I1, "x", lastback
set I2, I1
branch next
back:
set I1, I2
branch start
x??
start:
set I2, 0 #I2 used to make sure we haven't
# backtracked before
branch next
back:
if I2, lastback
rx_literal S0, I1, "x", lastback
set I2, 1
branch next
x|y|z
start:
set I2, I1 #I2 the beginning of the group
set I3, -6 #I3 next alternation in the group, expressed
#as an offset from the branch point
rx_literal S0, I1, "x", alt2
branch next
alt2:
set I1, I2
set I3, -3
rx_literal S0, I1, "y", alt3
branch next
alt3:
set I1, I2
set I3, 1
rx_literal S0, I1, "z", lastback
back:
branch I3
backlast:
x{m,n}
top_m:
inc I2
eq I2, m, n
rx_literal S0, I1, "x", lastback
branch top_m
n:
rx_pushmark
top_n:
inc I2
eq I2, n, out
rx_literal S0, I1, "x", next
rx_pushindex I1
branch top_n
back:
rx_popindex I1, lastback
branch next
XXX Finish this documentation.
Copyright (C) 2001-2004 The Parrot Team <perl6-internals@perl.org>.
Initial version by Brent Dax <brentdax@cpan.org>; special thanks to Angel Faus <afaus@corp.vlex.com> and Jeff 'japhy' Pinyan <japhy@pobox.com> for major help, especially with decisions on the architecture of the engine.
|