Unrolling the loop

Using an NFA engine which is the most common engine today, there are several optimisation that could be introduce by writing your regular expressions differently.  Jeffrey E. F. Friedl explains in details, how to craft an efficient regular expression for NFA engines, in his book «Mastering Regular Expressions». One of the most advanced technique shown by Jeffrey Friedl is what he have called the "Unrolling the loop" technique.

This optimisation thechnique is used to optimize repeated alternation of the form (expr1|expr2|...)*. These expression are not uncommon, and the use of another repetition inside an alternation may also leads to super-linear match. Super-linear match arise from the underterministic expression (a*)*. An exemple of super-linear match could be found in our backtracking example.

The unrolling the loop technique is based on the hypothesis that in most case, you kown in a repeteated alternation, which case should be the most usual and which one is exceptional. We will called the first one, the normal case and the second one, the special case. The general syntax of the unrolling the loop technique could then be written as:

normal* ( special normal* )*

Which could means something like, match the normal case, if you find a special case, matched it than match the normal case again. You notice that part of this syntax could potentialy lead to a super-linear match. To avoid a neverending match to append, the following rules shoud be carefully applied:

  • the start of the special case and the normal case must be mutually exclusive
  • special must always match at least one character
  • the special expression must be atomic: be careful to the fact that ( special normal* )* could be reduced to (special)*, which if special is special*, this became similar to (a*)* which is a undeterministic expression.

So keeping all these rules in mind, we could craft some efficient expressions:
Match C/C++ quotes : "[^"\\]* (\\.[^"\\]*)*" Match VB quotes : "[^"]*(""[^"]*)*" Match a SGML/XML tag taking care of > inside quotes: <[^>"]*("[^"]*"[^>"]*)*>

This is a really brief introduction of this interresting technique, remember to look at the «Mastering Regular Expressions» book for a full discussion.