Regular expressions
RegularExpressions (regex in short) is a powerful and flexible language for text searching. Regular expressions engines are efficient tools for parsing many kind of data, locating and even replacing data patterns. Regex are now available in most popular language either integrated or as an external library.
Regex are studied here both for their syntax to describe data patterns and their ability to efficiently match them against large amount of data. This document is neither an introduction nor a complete guide to regular expressions. We recommend all interested user to read the excellent O'Reilly's book from Jeffrey E. F. Friedl for really «Mastering Regular Expressions».
Regex's Origins
Regular Expressions is not a recent subject and were initially used in early 1940's to express the algebra of regular sets, a mathematical theory first describe by Stephen Kleene. Its theory has conduct to several study in the field of finite automata for data processing. The first regular expressions compiler was explain by Ken Thompson in article «Programming Techniques: Regular expression search algorithm» published in volume 11, issue 6, June 1968 of the Communication of the ACM.
An implementation of such compiler has been used in early release of ed, a well know *nix editor, and their regex functionality "Global Regular Expression Print" has engendered a still useful and well known tool, grep (which its currently most powerful implementation is GNU grep).
Regular expressions syntaxes and flavors
Beside the mostly unused POSIX standardization effort, their is a large variety of regular expression flavor in use today. Both the syntaxes and the engines are different and could lead to slightly different results for the same regex and input data. The integration of regular expression into the language also vary a lot, from full language integration to procedural and/or object libraries. Some tools and languages like perl and egrep has somewhat greatly enhance regex possibilities and are today's models for most emerging solutions.
All implementations of regular expressions usualy proceed in two steps1:
- lexical analysis and compilation of the regular expression to an appropriate byte code
- parsing of the input text using an regular expression engine
Regular expression engines are the base of the implementation of regular expression matching algorithm. There are several well known implementations, and the Henry Spencer's NFA engine has been the base for most regular expressions engines in use today. To have a pratical experience of the internal processing done by a Traditional NFA engine, read our regular expression examples.
- ^ Perl also provide a sort of compilation of the input text itself. Its casual use may sometimes improve the matching speed.