Backtracking in regards to correctness and efficiency

First, you should note that during a non-match, both traditional NFA and POSIX NFA RegularExpressionEngines proceed in the same way. POSIX NFA always do all the processing even when a successful match occurs. So, limiting backtracking in the non-match case also improve POSIX NFA engine when a match occurs.

All examples here has the same objective, matching an empty XML tag. Not all has the same correctness and for sure there efficiency are really varying.

<.+>: an easy crafted but imprecise regex

 

This example is the first that came to mind, it is almost incorrect as we will discover.

Initial state
Text


Regex
 
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
 
 
<.+/>
 
    Step 1 of 8

<.+?>: Still imprecise but not as lazy as we could thought

 

Since the previous does not achieve what we really want, lets try this one. We expect it better since we will not suffer of to much greediness. Hopeless, the problem here will be more inefficiency and too much laziness.

Initial state
Text


Regex
 
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
 
 
<.+?/>
 
    Step 1 of 11

<(?:/[^<>]|[^/<>])+/>: More precise but still inefficient

 

We have understand from the previous examples that we should be more precise. So here, we precise more carefully what could be between our opening and closing delimiter. In English we said, match a non-slash or a slash not followed by angle-brackets in place of just anything between angle brackets. For the sake and simplicity of this example, we do not consider here what starts with < (and unexpectidely ends with >).

We will discover that this approach is inefficient by looping to much on an alternation.

Initial state
Text


Regex
 
 
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
 
 
<(?:/[^<>]|[^/<>])+/>
 
 
    Step 1 of 16

<(?:[^/<>]+|/[^<>])+/>: Still more precise and nearly efficient

 

We now have a better understanding of the content we expect between our opening and closing tag. There is a normal case ([^/<>]+) and a special case (/[^<>]). We can learn from the previous example that a better ordering of our alternation will reduce dramaticaly unuseful tries and backtracks of our special case.

You may have also noticed that it could be interresting to avoid looping the alternation. Why not multiplying our normal case ? Multiplying our special case /[^>] may also seems interesting. However, our sample input shows that it will be really infrequent, so we will keep it as is.

Hopeless for us, we will discover what super-linear regex means. For the sake and simplicity of this example, we do not consider here what starts with < (and unexpectidely ends with >).

Initial state
Text


Regex
 
 
 
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
 
 
<(?:[^/<>]+|/[^<>])+/>
 
 
 
    Step 1 of 12

<[^/<>]+(?:/+[^/<>]+)*/>: The unrolling the loop technique

 

The unrolling the loop technique describe in details by Jeffrey Friedl, in his Mastering Regex book, is the most efficient techniques to match an open/close sequence in a regular expression and also avoid super-linear problems.

This technique is based on the idea that in our open/close sequence, the content is composed of a normal case and a special case. The general syntax is:

opening normal* (?:special normal*)* closing

However, applied to our specific example, the first normal* became a normal+ since there must be at least a normal case at the beginning. We explicitly exclude the startup. Again, after our special case, there should be at least a normal case, so we also exclude //>, ///>, and so on. Finally, since our special case is always followed by a normal one, we can safely multiply it using a +. If this was not the case, the (special+ normal*)* could be reduced to (special+)* which leads to a super-linear match.

Initial state
Text


Regex
 
 
 
<a href="http://www.softec.st><img src="softec.png"/><br/>SOFTEC sa</a>
 
 
<[^/<>]+(?:/+[^<>/]+)*/>
 
 
 
    Step 1 of 14

Comments
  • This is the most efficient solution
  • Here, we also consider starts with //< (and less usefull ends with //>) by multiplying the / in /+[^<>/]. We also exclude it for the following class, to respect rule XXX of the unrolling the loop technique.
  • We also consider that </> is not valid

You can see from this example that the unrolling the loop technique present several advantage:

  • When matches are successful, backtracking is limited to the minimum
  • When matches fails, backtracking is more important but every unusefull tries fails quickly. This could be optimized further by using possessive quantifier or atomic grouping when these features are available.

Compare to previous examples, you can also conclude that the unrolling the loop technique provide the most guided regular expression, and as a general rule, a well guided expression is generally a fast expression under a NFA engine since a NFA engine is driven by the expression similarly to a procedural language.