Any character and character classes
To match any character of a list of characters, you may construct a character class. A character class is surrounded by square brackets ([]). Between these brackets, you have to list all the character of that class. You can define range of characters with the dash syntax, ie. [a-z] matches all lowercase letter. A real dash could be included in a character classes as its first character only as in [-a-z] which matches a lowercase letter or a dash.
Negated character classes
You can also create a character class that matches any character except those of the class. Such a class begin with a caret (ie. [^a-z] is any character but a lowercase letter).
Unicode properties
Unicode implementation usually provide a shorthand for creating classes based on Unicode properties. Depending on the implementation, the same Unicode properties classes may have one or more shorthand. The following table shows most useful ones:
Shorthand | Synonyms | Properties |
---|---|---|
\p{L} | \p{Letter} | All letters |
\p{Ll} | \p{Lowercase_Letter} | Lowercase letters |
\p{Lu} | \p{Uppercase_Letter} | Uppercase letters |
\p{Lt} | \p{Titlecase_Letter} | Titlecase letters |
\p{L&} | equivalent to \p{Ll}¦\p{Lu}¦\p{Lt} | |
\p{Lm} | \p{Modifier_Letter} | Letter-like special character |
\p{Lo} | \p{Other_Letter} | Other letters not in \p{L&}¦\p{Lm} |
\p{M} | \p{Mark} | Character usualy combined with others, mostly accents |
\p{Mn} | \p{Non_Spacing_Mark} | Accents and similar marks |
\p{Mc} | \p{Space_Combining_Mark} | Mostly vowel sign in certain languages |
\p{Me} | \p{Enclosing_Mark} | Circle, square, ... used to enclosed other characters |
\p{Z} | \p{Separator} | Spaces and similar non visible characters |
\p{Zs} | \p{Space_Separator} | Various spaces characters |
\p{Zl} | \p{Line_Separator} | U+0085 |
\p{Zp} | \p{Paragraph_Separator} | U+2029 |
\p{S} | \p{Symbol} | Mathematical symbols and other dingbats |
\p{Sm} | \p{Math_Symbol} | Mathematical symbols |
\p{Sc} | \p{Currency_Symbol} | Currency symbols |
\p{Sk} | \p{Modifier_Symbol} | Modifier letter as normal symbol |
\p{So} | \p{Other_Symbol} | Dingbats, box-drawing and so on |
\p{N} | \p{Number} | Numeric characters |
\p{Nd} | \p{Decimal_Digit_Number} | Decimal digits in various languages |
\p{Nl} | \p{Letter_Number} | Roman numerals |
\p{Nl} | \p{Other_Number} | Subcript and superscripts digit, some numeric symbols, ... |
\p{P} | \p{Punctuation} | Punctuation characters |
\p{Pd} | \p{Dash_Punctuation} | Hyphens and dashes |
\p{Ps} | \p{Open_Punctuation} | Opening parenthesis, brackets... |
\p{Pe} | \p{Close_Punctuation} | Closing parenthesis, brackets... |
\p{Ps} | \p{Initial_Punctuation} | Opening quotes... |
\p{Pe} | \p{Final_Punctuation} | Closing quotes... |
\p{Pc} | \p{Connector_Punctuation} | Underscore and similar kind of punctuations... |
\p{Po} | \p{Other_Punctuation} | Other punctuation not in other classes |
\p{C} | \p{Other} | Other characters not in previous classes |
\p{Cc} | \p{Control} | Control characters |
\p{Cf} | \p{Format} | Non-visible formating characters |
\p{Co} | \p{Private_Use} | Private area |
\p{Cn} | \p{Unassigned} | Unassigned Unicode points |
Predefined classes
Most implementation also provide several shorthand for useful character classes:
Shorthand | Class | Unicode remarks |
---|---|---|
\d | [0-9] | May also include all Unicode digits |
\D | [^0-9] | May also exclude all Unicode digits |
\w | [a-zA-Z0-9_] | May also include all Unicode letters or current locale letters. The underscore is sometimes not included |
\W | [^a-zA-Z0-9_] | Opposite of \w |
\s | [ \f\n\r\t\v] | May also include U+0085 and sometimes all Unicode white-space character (\p{Z}) |
\S | [^ \f\n\r\t\v] | Opposite of \s |
And POSIX has also defined some brackets expression that could be used in classes:
POSIX classes | Class |
---|---|
[:alnum:] | [a-zA-Z0-9] |
[:alpha:] | [a-zA-Z] |
[:blank:] | [ \t] |
[:cntrl:] | [\cA-\cZ] |
[:digit:] | [0-9] |
[:graph:] | Visible characters (not space, control and so on...) |
[:lower:] | [a-z] |
[:print:] | [:graph:]¦[:space:] |
[:punct:] | Punctuation characters |
[:space:] | [ \f\n\r\t\v] |
[:upper:] | [A-Z] |
[:xdigit:] | [0-9A-Fa-f] |
So, to match alphanumerique and punctuation characters, use [[:alnum:][:punct:]].POSIX brackets expression are locale dependent which may introduce some differences in the definition stated above. Some more expressions are also available depending of the locale. Finally, Unicode implementation may use Unicode properties to define these expressions.POSIX has also introduce bracket expression for collating sequence classes and character equivalent classes. The first one is used to match multiple character as a single one. For example, the Spanish word tortilla contains the collating sequence ll which could be match using [[.span-ll.]]. The second one is used to create a classes containing all the variation of a same character. For example, [[a]] is similar to [aàâáã…].