Regular Expression Engines

Regular expression engines are the base of the implementation of regular expression matching algorithm. There are several well known implementations. The Henry Spencer's NFA engine has been the base of most of them.

Regex implementation are based on two main kind of engine: DFA and NFA. Perl, Java, .NET languages, PHP, Python, Ruby,... and most tools implement a Traditional NFA engines. Some less widespread tools like mawk use a POSIX NFA engine which is a variation of the previous one. Awk, egrep, flex, lex, MySQL,... that mostly needs to verify efficiently the success of an overall match implement the more efficient DFA engine. Finally some tools like GNU awk, GNU egrep and Tcl used the best of both world with an hybrid NFA/DFA engine.

Depending on the implementation and the base engine type, the following element may differ:

  • Syntax
  • What will be matched
  • Look-around features
  • Capture features
  • Efficiency optimization

All engines try to find the longest leftmost match of the pattern in the text. But what is  the longest leftmost may be different depending on the type of engines.

Traditional NFA engine

The Nondeterministic Finite Automaton algorithm used in an NFA engine is an algorithm driven by the expression to be matched. This makes the regular expression like a mini procedural-like language that control the way the engine will try/fail during a match.

The engine will start at the beginning of the expression and try to match it against the beginning of the text. Until both matches, it will bumps-along and attempt to match the start of the regex at the next character in the text.

When both starts to match, the engine will advance in the regex, and try to match the next part of it. If it have to make a choice in the regex, like with an alternation or an optional quantifier, it choose one of the alternative and remember the other(s) as well as the place in the text where the choice was made. If later in the processing, the match failed, and it have left some alternative unexplored, the engine will backtrack where the choice was made and try the other(s) alternative(s). If no more alternatives are available, the overall match fail and the bump-along moves the engine to the next character.

If the engine reach the end of the regex and everything has matched, it takes this expression has the successful match, and eventually drops all alternatives left behind without even exploring them. This has one important consequence: the order choosed to try the alternatives is significant, and the longest-leftmost match could not always be respected.

The engine really follow the regular expression for its matches and let you choose every steps it will take to reach a sucessfull match. You will have to be careful to tell him which alternative to check first to achieve what you expect. And you will also have the opportunity to tune your regular expression to minimize backtracking and match early. Some examples of the method used in an NFA engine could also help you understand how backtracking really works.

POSIX NFA engine

A POSIX NFA engine is similar to a Traditional NFA engine, but when a successful match is found, it will be memorized and other available alternatives will still be tried to see if a longer leftmost match could be found. This remove the important consequence shown in the traditional NFA engine.

Taking this interesting (horribly contrived) exemple from Friedl's book: match /one(self)?(selfsufficient)?/ against 'oneselfsufficient', a Traditionnal NFA will match 'oneself' and a POSIX NFA will match 'oneselfsufficient' because the alternative between matching /(self)?/ or nothing will be tried with nothing and leads to a longer match.

DFA engine

The Deterministic Finite Automaton algorithm used in an DFA engine is an algorithm driven by the text to be searched. For each characters in the searched text, the DFA engine will look at it only once. To achieve that, it will maintains several on-going matches simultaneously during its progression. It will in fact tried all the possible alternative {in parrallel}, and will return the longest leftmost successful match.

This method is really more efficient but it also have important drawbacks:

  • You have no control on the way the expression return a match, it will always return the longuest-leftmost match, no matter how you have crafted your expression.
  • There is no backtracking, so all repeating operators are greedy. Atomic grouping and possessive quantifiers are also non-senses. (See RegularExpressionsBacktracking for more details)
  • Lookarounds are impossibles
  • Capturing and back-references are also impossible
  • Regex pre-compilation is longer and takes more memory

Hybrid NFA/DFA

Since the DFA engine is lightning fast, but features of the NFA engine are also really useful, an hybrid NFA/DFA engine try to take the best of both worlds. (AFAIK) The only fully hybrid implementation available to date is the Henry Spencer's engine used in Tcl which is a complete rewrite of its original implementation. The GNU egrep and awk one are just two separate engines which are choosen depending of the presence of NFA specific features.