First, you should note that during a non-match, both traditional NFA and POSIX NFA RegularExpressionEngines proceed in the same way. POSIX NFA always do all the processing even when a successful match occurs. So, limiting backtracking in the non-match case also improve POSIX NFA engine when a match occurs.
All examples here has the same objective, matching an empty XML tag. Not all has the same correctness and for sure there efficiency are really varying.
<.+>: an easy crafted but imprecise regex
This example is the first that came to mind, it is almost incorrect as we will discover.
|
Text
Regex |
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
<.+/>
|
|
|
Text
Regex |
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
¯
_
<.+/>
|
|
Comments
|
Text
Regex |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
__
<.+/>
~~
|
|
Comments
- Match any character repeatedly with success
- Note that the first character is not a backtracking option because
at least one character should be matched, so the choice between
matching or skipping happens only after the first character. In fact,
.+
is equivalent to ..*
.
|
Text
Regex |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
¯^^
_
<.+/>
~~
|
|
Comments
- After .+, we were at the end of the string, so matching
/
fails.
- Backtracking starts and progressively consume saved choice by trying a
/
at each steps.
- At the third saved choice, a
/
match
Fail to match a > after the / |
|
Text
Regex |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
^
_
<.+/>
~~
|
|
Comments
- Hopeless the
>
does not match after the /
, so backtracking should restart
Backtracking to match a / |
|
Text
Regex |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
¯^^^^^^^^^^^
_
<.+/>
~~
|
|
Comments
- Backtracking restarts and progressively consume saved choice by trying again to match
/
at each steps.
- Twelve saved choice later, a
/
finally match
|
Text
Regex |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
¯
_
<.+/>
~~
|
|
Comments
- This time we get the
>
after /
and this lead to a whole successful match.
- For a Traditional NFA, this is finished.
- For the POSIX NFA, it has to try all remaining alternatives as show on the next step
POSIX NFA try remaining saved choices |
|
Text
Regex |
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
^^^^
¯¯
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
¯^
¯^
^^^^^^^^^^^^
__
<.+/>
|
|
Comments
- This step show all the remaining tries done by a POSIX NFA
- One more successful match is found, the one we are trying to get,
but this one is rejected in accordance with the rules: leftmost-longest
match wins.
- As we could see here, our regex is both incorrect and inefficient
<.+?>: Still imprecise but not as lazy as we could thought
Since the previous does not achieve what we really want, lets try this one. We expect it better since we will not suffer of to much greediness. Hopeless, the problem here will be more inefficiency and too much laziness.
|
Text
Regex |
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
<.+?/>
|
|
|
Text
Regex |
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
¯
_
<.+?/>
|
|
Comments
Match .+? once than save state |
|
Text
Regex |
~
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
¯
___
<.+?/>
~~~
|
|
Comments
- Like in the previous example, the choice between matching or
skipping happens only after the first character, so the first character
is matched
- Then, the lazy quantifier decide to skip and saved the choice to match.
|
Text
Regex |
~
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
^
_
<.+?/>
~~~
|
|
Comments
- Fail to match the
/
, so backtracking occurs
Backtracking, matching, saving |
|
Text
Regex |
~
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
¯
___
<.+?/>
~~~
|
|
Comments
- As you probably notice, we restart the same process than step 3 with added backtracking
Repeated failing matching /, matching ., saving, failing, backtracking |
|
Text
Regex |
~
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
¯
___
<.+?/>
~~~
|
|
Comments
- Step 4 and 5 repeats eleven time before we have a potential match of
/
Matching / then failling on > |
|
Text
Regex |
~
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
¯^
__
<.+?/>
~~~
|
|
Comments
- Matching
/
with success
- Then try to match
>
but fails, so backtracking occurs
Repeated matching/failing /, matching ., saving, failing, backtracking |
|
Text
Regex |
~
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
¯
___
<.+?/>
~~~
|
|
Comments
- Step 4, 5 and 6 repeats once
- Then steps 4 and 5 repeats again several times up to another place where
/
could be tried with success
|
Text
Regex |
~
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
¯¯
__
<.+?/>
~~~
|
|
Comments
- Matching
/
with success
- Then matching
>
with success
- For traditional NFA, game is over since a successful match was found, hopeless, this is not the one we expect
- For POSIX NFA, we are not over, there is still one alternative to try, so a NFA engines will continue.
POSIX NFA continues with repeated matching, saving, and so on |
|
Text
Regex |
~
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
¯
___
<.+?/>
~~~
|
|
Comments
- Again steps 4 and 5 repeats several times up to another place where
/
could be tried with success
POSIX NFA finds the same match than the greedy example |
|
Text
Regex |
~
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
¯¯
__
<.+?/>
~~~
|
|
Comments
- A POSIX NFA engine founds the same match we got in example one.
But you notice that it was really inefficient this time. Fewer
backtracking has been kept simultaneously on the stack, but a lots of
failed tries and backtracks as occurs.
- Since this match is longer than the previous, it is kept as a better successful match
- However, it is not over, there is still a saved
alternative to explore. A POSIX NFA, will continue step 4 and 5 up to
the end of string before considering this match as the longest-leftmost
one.
<(?:/[^<>]|[^/<>])+/>: More precise but still inefficient
We have understand from the previous examples that we should be more precise. So here, we precise more carefully what could be between our opening and closing delimiter. In English we said, match a non-slash or a slash not followed by angle-brackets in place of just anything between angle brackets. For the sake and simplicity of this example, we do not consider here what starts with < (and unexpectidely ends with >).
We will discover that this approach is inefficient by looping to much on an alternation.
|
Text
Regex |
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
<(?:/[^<>]|[^/<>])+/>
|
|
|
Text
Regex |
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
¯
_
<(?:/[^<>]|[^/<>])+/>
|
|
Comments
|
Text
Regex |
~
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
^
_
<(?:/[^<>]|[^/<>])+/>
~~~~~~
|
|
Comments
- There is a greedy quantifier +, as explained before, saving states will start after the first successful match
- There is an alternation, the second alternative is saved for backtracking
- The first alternative is tried without success, so backtracking occurs
|
Text
Regex |
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
¯
______
<(?:/[^<>]|[^/<>])+/>
|
|
Comments
- The second alternative match successfully
Repeat the alternation and retry / without success |
|
Text
Regex |
~
~
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
^
_
<(?:/[^<>]|[^/<>])+/>
~~~~~~
~~~~~~~~~~~~~~~~~~
|
|
Comments
- This the second match of the greedy quantifier +, we have to save the skip it alternative
- There is an alternation, the second alternative is saved for backtracking
- The first alternative is tried without success, so backtracking occurs
Try the second alternative with success |
|
Text
Regex |
~
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
¯
______
<(?:/[^<>]|[^/<>])+/>
~~~~~~~~~~~~~~~~~~
|
|
Comments
- It backtracks to the second alternative choice previously saved
- The second alternative match successfully
Match the / of the first alternative with success |
|
Text
Regex |
~~~~~~~~~~~~~
~
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
¯
_
<(?:/[^<>]|[^/<>])+/>
~~~~~~
~~~~~~~~~~~~~~~~~~
|
|
Comments
- Steps for 5 and 6 repeat several times
- Than the first alternative success on the first
/
of http://
- At that time, we have several saved choice for the greedy quantifier
+
and one choice for the alternation.
Match the first alternative with success |
|
Text
Regex |
~~~~~~~~~~~~~
~
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
¯
_____
<(?:/[^<>]|[^/<>])+/>
~~~~~~
~~~~~~~~~~~~~~~~~~
|
|
Comments
- The characters following the slash is not an angle brackets, so it match successfully
Match the second alternative several more times, then failed |
|
Text
Regex |
~~~~~~~~~~~~~ ~~~~~~~~~~~~~~
~
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
^
______
<(?:/[^<>]|[^/<>])+/>
~~~~~~
~~~~~~~~~~~~~~~~~~
|
|
Comments
- Inefficient steps 5 and 6 repeats agains several times
- It finally fail to match any of the alternatives on a closing angle brackets, so backtracking occurs
Backtraking and trying the end delimiter |
|
Text
Regex |
~~~~~~~~~~~~~ ~~~~~~~~~~~~~
~
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
^
_
<(?:/[^<>]|[^/<>])+/>
~~~~~~
~~~~~~~~~~~~~~~~~~
|
|
Comments
- It backtracks to the skip it choice of the
+
quantifier
- After skipping the expression in parenthesis, it tries the
/
of the closing delimiter and fails, so backtracking occurs again
Backtraking on the alternative |
|
Text
Regex |
~~~~~~~~~~~~~
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
^
______
<(?:/[^<>]|[^/<>])+/>
~~~~~~~~~~~~~~~~~~
|
|
Comments
- Previous step repeat several times
- A choice of the alternative arise at the top of the stack of saved option
- It tries the second alternative without success, and backtracking continues
All backtracking options has been tried |
|
Text
Regex |
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
^
_
<(?:/[^<>]|[^/<>])+/>
|
|
Comments
- Step 10 repeat several more times
- Finally, all backtracking options has been tried without success
- The bump-along take us to the next character in the input text, and we try the expression again
- The
<
does not match, so backtracking occurs
There are no backtracking options |
|
Text
Regex |
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
¯
_
<(?:/[^<>]|[^/<>])+/>
|
|
Comments
- Since, there are no backtracking option, previous step repeat several times
- Finally, the
<
match again
After matching the second alternative, the first one starts matching |
|
Text
Regex |
~~~~~~~~~~~~~~~~~~~~
~
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
¯^
______
<(?:/[^<>]|[^/<>])+/>
~~~~~~
~~~~~~~~~~~~~~~~~~
|
|
Comments
- Steps 3 and 4 occurs and then steps 5 and 6 repeats
- We reach similar step to step 7, where the first alternative start matching the
/
- But it finaly fails, so backtracking occurs
The second alternatives also fails |
|
Text
Regex |
~~~~~~~~~~~~~~~~~~~~
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
^
______
<(?:/[^<>]|[^/<>])+/>
~~~~~~~~~~~~~~~~~~
|
|
Comments
- The second alternative also fails, and backtracking occurs
The end delimiter is matching now |
|
Text
Regex |
~~~~~~~~~~~~~~~~~~~
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
¯¯
__
<(?:/[^<>]|[^/<>])+/>
~~~~~~~~~~~~~~~~~~
|
|
Comments
- It backtracks to the skip it choice of the
+
quantifier
- After skipping the expression in parenthesis, it tries the
/
then the ">" of the closing delimiter with success
- It reached the end of the expression, for a traditional NFA, the
regex succeed and that is at long last finished. However, this times,
we have really match what we expect. But what a lot unuseful tries and
backtracks!
- For a POSIX NFA, there is still more works, it have to backtrack
all remaining saved choices, but it will not found any more sucessful
match and will finaly conclude that this one is ok.
<(?:[^/<>]+|/[^<>])+/>: Still more precise and nearly efficient
We now have a better understanding of the content we expect between our opening and closing tag. There is a normal case ([^/<>]+) and a special case (/[^<>]). We can learn from the previous example that a better ordering of our alternation will reduce dramaticaly unuseful tries and backtracks of our special case.
You may have also noticed that it could be interresting to avoid looping the alternation. Why not multiplying our normal case ? Multiplying our special case /[^>] may also seems interesting. However, our sample input shows that it will be really infrequent, so we will keep it as is.
Hopeless for us, we will discover what super-linear regex means. For the sake and simplicity of this example, we do not consider here what starts with < (and unexpectidely ends with >).
|
Text
Regex |
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
<(?:[^/<>]+|/[^<>])+/>
|
|
|
Text
Regex |
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
¯
_
<(?:[^/<>]+|/[^<>])+/>
|
|
Comments
|
Text
Regex |
~
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
¯
______
<(?:[^/<>]+|/[^<>])+/>
~~~~~~
|
|
Comments
-
(?:…)+
as to be matched at least one, so no backtrack for now, remember the rules, (?:…)+
is equivalent to (?:…)(?:…)*
- There is an alternation, it saved the second alternative
-
[^/<>]+
as also to be matched at least one, so no backtrack option saved
-
[^/<>]
match successfully
|
Text
Regex |
~
~
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
¯
_______
<(?:[^/<>]+|/[^<>])+/>
~~~~~~~
~~~~~~
|
|
Comments
- The
[^/<>]+
repeats, and this time, it have to save the skip it alternative
- Then
[^/<>]
match again successfully
Repeated matching of [^/<>] then failed |
|
Text
Regex |
~
~~~~~~~~~~~~~
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
^
_______
<(?:[^/<>]+|/[^<>])+/>
~~~~~~~
~~~~~~
|
|
Comments
- Previous step repeat several times
-
[^/<>]
finaly failed on the first /
of http://
, so backtracking occurs
End of the repetion, but retry it again |
|
Text
Regex |
~
~ ~
~~~~~~~~~~~~
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
^
_______
<(?:[^/<>]+|/[^<>])+/>
~~~~~~~
~~~~~~
~~~~~~~~~~~~~~~~~~~
|
|
Comments
- The skip it option of
[^/<>]+
succeed
- The repetition of
(?:…)+
is then involved for the second time, so its skip it choice is saved.
- Then, the alternation is retried, and it has to save its alternative choice.
- Finaly, the
[^/<>]+
is tried. No skip it choice is saved because it is the first time.
- Obviously, the match failed, and backtracking occurs
Match the second alternative |
|
Text
Regex |
~
~
~~~~~~~~~~~~
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
¯¯
______
<(?:[^/<>]+|/[^<>])+/>
~~~~~~~
~~~~~~
~~~~~~~~~~~~~~~~~~~
|
|
Comments
- First on the stack of saved states, there is the alternation's second alternative
- This second alternative is matched succesfully
Repeat the alternation and its first alternative until fails |
|
Text
Regex |
~ ~
~ ~
~~~~~~~~~~~~ ~~~~~~~~~~~~~
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
¯¯¯¯¯¯¯¯¯¯¯¯¯^
_______
<(?:[^/<>]+|/[^<>])+/>
~~~~~~~
~~~~~~
~~~~~~~~~~~~~~~~~~~
|
|
Comments
- Step 3, 4 and 5 happend again
- then the first alternative failed
Repeat the alternation and its alternatives |
|
Text
Regex |
~ ~ ~
~ ~
~~~~~~~~~~~~ ~~~~~~~~~~~~
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
^
______
<(?:[^/<>]+|/[^<>])+/>
~~~~~~~
~~~~~~
~~~~~~~~~~~~~~~~~~~
|
|
Comments
- Step 6 and 7 happend again, but this time, the second alternative does not matched.
Both alternatives having fail, skip the alternation |
|
Text
Regex |
~ ~
~ ~
~~~~~~~~~~~~ ~~~~~~~~~~~~
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
^
__
<(?:[^/<>]+|/[^<>])+/>
~~~~~~~
~~~~~~
~~~~~~~~~~~~~~~~~~~
|
|
Comments
- The saved skip it option of
(?:…)+
obviously succeed
- later, the try to match the closing expression failed, so backtracking occurs
Involve the skip it option of [^/<>]+ |
|
Text
Regex |
~ ~ ~
~ ~ ~
~~~~~~~~~~~~ ~~~~~~~~~~~
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
¯
_______
<(?:[^/<>]+|/[^<>])+/>
~~~~~~~
~~~~~~
~~~~~~~~~~~~~~~~~~~
|
|
Comments
- On top of the backtracking stack, the saved choice is to skip
[^/<>]+
which obviously succeed.
- So the alternation succeed, and another alternation is tried due to the
(?:…)+
. This one save its own skip it option.
- The alternation save its alternative choice.
- Then
[^/<>]+
is tried...
- ...and match !!!
|
____________ ___________ ___________ _ __________ __________ _
www.softec.st, www.softec.st, www.softec.st, www.softec.st, www.softec.st,
¯ ¯¯ ¯ ¯¯¯ ¯¯
__________ __ __________ _ _________ _________ _ _________ __
www.softec.st, www.softec.st, www.softec.st, www.softec.st, www.softec.st,
¯ ¯ ¯ ¯¯¯¯ ¯¯¯ ¯¯
_________ _ _________ ___ _________ __ _________ _ _________ _ _
www.softec.st, www.softec.st, www.softec.st, www.softec.st, www.softec.st,
¯¯ ¯ ¯ ¯ ¯ ¯ ¯¯ ¯ ¯
and so on...
|
|
Comments
- and you got a super-linear match!
- Illustration shows the firsts steps of a match of the
www.softec.st
part using two, then three, and then four [^/<>]+
and so on
- You should note that this only happen when failing for a traditional NFA and this always happen to a POSIX NFA.
<[^/<>]+(?:/+[^/<>]+)*/>: The unrolling the loop technique
The unrolling the loop technique describe in details by Jeffrey Friedl, in his Mastering Regex book, is the most efficient techniques to match an open/close sequence in a regular expression and also avoid super-linear problems.
This technique is based on the idea that in our open/close sequence, the content is composed of a normal case and a special case. The general syntax is:
opening normal* (?:special normal*)* closing
However, applied to our specific example, the first normal* became a normal+ since there must be at least a normal case at the beginning. We explicitly exclude the startup. Again, after our special case, there should be at least a normal case, so we also exclude //>, ///>, and so on. Finally, since our special case is always followed by a normal one, we can safely multiply it using a +. If this was not the case, the (special+ normal*)* could be reduced to (special+)* which leads to a super-linear match.
|
Text
Regex |
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
<[^/<>]+(?:/+[^<>/]+)*/>
|
|
Comments
- This is the most efficient solution
- Here, we also consider starts with
//<
(and less usefull ends with //>
) by multiplying the / in /+[^<>/]
. We also exclude it for the following class, to respect rule XXX of the unrolling the loop technique.
- We also consider that
</>
is not valid
|
Text
Regex |
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
¯
_
<[^/<>]+(?:/+[^<>/]+)*/>
|
|
Comments
Match the [^/<>] repeatedly |
|
Text
Regex |
~~~~~~~~~~~~~
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
¯¯¯¯¯¯¯¯¯¯¯¯¯^
_______
<[^/<>]+(?:/+[^<>/]+)*/>
~~~~~~~
|
|
Comments
- Repeteadly match
[^/<>]+
(saving skip it options) until it fails.
|
Text
Regex |
~
~~
~~~~~~~~~~~~
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
¯¯^
__
<[^/<>]+(?:/+[^<>/]+)*/>
~~~~~~~
~~
~~~~~~~~~~~~~~
|
|
Comments
- Backtrack occurs, and the skip it option of
[^/<>]+
obviously succeed
- Next, the
(?:…)*
is tried and its skip it option is saved
- Then
/+
matches twice (saving their own skip it option) before failing
Match the [^<>/]+ repeatedly</b> |
|
Text
Regex |
~
~
~~~~~~~~~~~~ ~~~~~~~~~~~~~
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
¯¯¯¯¯¯¯¯¯¯¯¯¯^
_______
<[^/<>]+(?:/+[^<>/]+)*/>
~~~~~~~ ~~~~~~~
~~
~~~~~~~~~~~~~~
|
|
Comments
- Backtrack occurs, and the skip it option of
/+
obviously succeed
- Next, the
[^<>/]+
matches repeatedly saving skip it option is saved along the way, until it fails.
|
Text
Regex |
~
~
~~~~~~~~~~~~ ~~~~~~~~~~~~
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
^
_
<[^/<>]+(?:/+[^<>/]+)*/>
~~~~~~~ ~~~~~~~
~~
~~~~~~~~~~~~~~
|
|
Comments
- Backtrack occurs, and the skip it option of
[^<>/]+
obviously succeed
- Next, the
(?:…)*
try to repeat but fails, and obviously its skip it option succeed.
- So the matches of
/
is tried and fail
Several backtracking occurs on the skip it of [^<>/]+ |
|
Text
Regex |
~
~
~~~~~~~~~~~~
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
^
_
<[^/<>]+(?:/+[^<>/]+)*/>
~~~~~~~
~~
~~~~~~~~~~~~~~
|
|
Comments
- Backtrack occurs, and the skip it option of
[^<>/]+
obviously succeed
- Next, the
(?:…)*
try to repeat but fails, and obviously its skip it option succeed.
- So the matches of the last
/
is tried again and fails
- These steps repeat several times, until all skip it option of
[^<>/]+
are consumed. Notice that here, super-linear match is avoid since the (?:…)*
does not repeat an alternation and this way [^<>/]+
may not be retried several times like in the previous example.
- And the matches of
/
fails again
And more backtracking occurs for the skip ip of / |
|
Text
Regex |
~
~~~~~~~~~~~~
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
^
_______
<[^/<>]+(?:/+[^<>/]+)*/>
~~~~~~~
~~~~~~~~~~~~~~
|
|
Comments
- Backtrack occurs, and the skip it option of
/+
obviously succeed
- Next, the
[^<>/]+
is tried and obviously does
not succeed. If it have, that would have been shame, since all previous
steps would have repeated for nothing.
Skip the (?:…) and fails on /> |
|
Text
Regex |
~~~~~~~~~~~~
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
¯^
__
<[^/<>]+(?:/+[^<>/]+)*/>
~~~~~~~
|
|
Comments
- Backtrack occurs, and the skip it option of
(?:…)*
obvioulsy succeed
- Then, the
/
matches, but the >
fails
Several backtracking occurs on the skip it of [^/<>]+ |
|
Text
Regex |
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
^
_
<[^/<>]+(?:/+[^<>/]+)*/>
|
|
Comments
- Backtrack occurs, and the skip it option of
[^/<>]+
obvioulsy succeed
- Then, the
(?:…)*
is retried, but fails on the /+
, and its skip it option succeed
- The final
/
is then tried, and fails
- These steps repeats until there is no more skip it options for
[^/<>]+
Bump along until the next < |
|
Text
Regex |
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
¯
_
<[^/<>]+(?:/+[^<>/]+)*/>
|
|
Comments
- Since there is no more backtrack option, the current try has completly fails, a bump to the next char occurs
- Previous steps repeats until we reach another
<
Match the [^/<>] repeatedly |
|
Text
Regex |
~~~~~~~~~~~~~~~~~~~~
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯^
_______
<[^/<>]+(?:/+[^<>/]+)*/>
~~~~~~~
|
|
Comments
- Repeteadly match
[^/<>]+
(saving skip it options) until it fails.
|
Text
Regex |
~
~~~~~~~~~~~~~~~~~~~
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
¯^
_______
<[^/<>]+(?:/+[^<>/]+)*/>
~~~~~~~
~~~~~~~~~~~~
|
|
Comments
- Backtracks occur and the skip ip options of
[^/<>]+
succeed
- The (?:…)* is tried, and its skip it option saved
- The
/+
match once than fails and its skip it option succeed
- Then the
[^<>/]+
fails
|
Text
Regex |
~~~~~~~~~~~~~~~~~~~
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
¯¯
__
<[^/<>]+(?:/+[^<>/]+)*/>
~~~~~~~
|
|
Comments
- Backtracks occur and the skip ip options of
(?:…)*
succeed
- Then the final
/>
matches, leading to a successfull match. For a Traditional NFA, the job is over, for a POSIX NFA, all remaining skip it options of [^/<>]+
will be examined without success before this same match is finaly returned
You can see from this example that the unrolling the loop technique present several advantage:
- When matches are successful, backtracking is limited to the minimum
- When matches fails, backtracking is more important but every unusefull tries fails quickly. This could be optimized further by using possessive quantifier or atomic grouping when these features are available.
Compare to previous examples, you can also conclude that the unrolling the loop technique provide the most guided regular expression, and as a general rule, a well guided expression is generally a fast expression under a NFA engine since a NFA engine is driven by the expression similarly to a procedural language.