Glossario

[Lezione 1] - Regular Expressions

23/02/2017


Special characters

\ the backslash escape character.
The backslash gives special meaning to the character following it. For example, the combination "\n" stands for the newline, one of the control characters. The combination "\w" stands for a "word" character, one of the convenience escape sequences while "\1" is one of the substitution special characters.
Example: The regex "aa\n" tries to match two consecutive "a"s at the end of a line, inclusive the newline character itself.
Example: "a\+" matches "a+" and not a series of one or "a"s.
^ the caret is the start of line anchor or the negate symbol.
Example: "^a" matches "a" at the start of a line.
Example: "[^0-9]" matches any non digit.
$ the dollar is the end of line anchor.
Example: "b$" matches a "b" at the end of a line.
Example: "^b$" matches the empty line.
{ } the open and close curly bracket are used as range quantifiers.
Example: "a{2,3}" matches "aa" or "aaa".
[ ] the open and close square bracket define a character class to match a single character.
The "^" as the first character following the "[" negates and the match is for the characters not listed. The "-" denotes a range of characters. Inside a "[ ]" character class construction most special characters are interpreted as ordinary characters.
Example: "[d-f]" is the same as "[def]" and matches "d", "e" or "f".
Example: "[a-z]" matches any lowercase characters in the alfabet.
Example: "[^0-9]" matches any character that is not a digit.
Example: A search for "[][()?<>.*?]" in the string "[]()?<>.*?" followed by a replace string "r" has the result "rrrrrrrrrrrrr". Here the search string is one character class and all the meta characters are interpreted as ordinary characters without the need to escape them.
( ) the open and close parenthesis are used for grouping characters (or other regex).
The groups can be referenced in both the search and the substitution phase. There also exist some special constructs with parenthesis.
Example: "(ab)\1" matches "abab".
. the dot matches any character except the newline.
Example: ".a" matches two consecutive characters where the last one is "a".
Example: ".*\.txt$" matches all strings that end in ".txt".
* the star is the match-zero-or-more quantifier.
Example: "^.*$" matches an entire line.
+ the plus is the match-one-or-more quantifier.
? the question mark is the match-zero-or-one quantifier. The question mark is also used in special constructs with parenthesis and in changing match behaviour.
| the vertical pipe separates a series of alternatives.
Example: "(a|b|c)a" matches "aa" or "ba" or "ca".
< > the smaller and greater signs are anchors that specify a left or right word boundary.
- the minus indicates a range in a character class (when it is not at the first position after the "[" opening bracket or the last position before the "]" closing bracket.
Example: "[A-Z]" matches any uppercase character.
Example: "[A-Z-]" or "[-A-Z]" match any uppercase character or "-".
& the and is the "substitute complete match" symbol.


Quantifiers

* Try to match the preceding regular expression zero or more times.
Example: "(ab)c*" matches "ab" followed by zero or more "c"s, i.e., "ab", "abc", "abcc", "abccc" ...
+ Try to match the preceding regular expression one or more times.
Example: "(ab)c+" matches "ab" followed by one or more "c"s, i.e., "abc", "abcc", "abccc" ...
{m, n} Try to match the preceding regular expression between m and n times.
If you leave m out, it is assumed to be zero. If you leave n out it is assumed to be infinity. I.e., "{,n}" matches from zero to n times, "{m,}" matches a minimum of m times, "{,}" matches the same as "*" and "{n}" is shorthand for "{n, n"} and matches exactly n times.
Example: "(ab){1,2}" matches "ab" and "abab".
? Try to match zero or one time.

Changing match behaviour
Default the quantifiers above try to match as much as possible, they are greedy. You can change greedy behaviour to lazy behaviour by adding an extra "?" after the quantifier.
Example: In the string "cabddde", the search "abd{1,2}" matches "abdd", while the search for "abd{1,2}?" matches "abd".
Example: In the string "cabddde", the search "abd+" matches "abddd", while the search for "abd+?" matches "abd".


Anchors

^ Try to match the (following) regex at the beginning of a line.
Example: "^ab" matches "ab" only at the beginning of a line and not, for example, in the line "cab".
$ Try to match the (following) regex at the end of a line.
< Try to match the regex at the start of a word.
The character class that defines a word can be found at the convenience escape sequences page.
> Try to match the regex at the end of a word.
\B Not a word boundary


Special constructs with parenthesis

Some special constructs exist with parenthesis.
(?:regex) is a grouping-only construct.
They exist merely for efficiency reasons and facilitate grouping.
(?=regex) is a positive look-ahead.
A match of the regular expression contained in the positive look-ahead construct is attempted. If the match succeeds, control is passed to the regex following this construct and the text consumed by this look-ahead construct is first unmatched.
(?!regex) is a negative look-ahead.
Functions like a positive look-ahead, only the regex must not match.
Example: "abc(?!.*abc.*)" searches for the last occurrence of "abc" in a string.
(?iregex) is a case insensitive regex.
(?Iregex) is a case sensitive regex.
Default a regex is case sensitive.
Example: "(?iaa)" matches "aa", "aA", "Aa" and "AA".
(?nregex) matches newlines.
(?Nregex) doesn't match newlines.
All the constructs above do not capture text and cannot be referenced, i.e., the parenthesis are not counted. However, you can make them capture text by surrounding them with ordinary parenthesis.


Special control characters

\a alert (bell)
\b backspace
\e ASCII escape character
\f form feed (new page)
\n newline
\r carriage return.
Example : a search for "\r\n" followed by a replace "\r" changes Windows text files to Macintosh text files.
Example : a search for "\r" followed by a replace "\n" changes Macintosh text files to Unix text files.
Example : a search for "\r\n" followed by a replace "\n" changes Windows text files to Unix text files.
\t horizontal tab
\v vertical tab


Convenience escape sequences

\d matches a digit: [0-9]
Example: "-?\d+" matches any integer
\D not a digit: [^0-9]
\l a letter: [a-zA-Z]
\L not a letter: [^a-zA-Z]
\s whitespace: [ \t\n\r\f\v]
\S not whitespace: [^ \t\n\r\f\v]
\w "word" character: [a-zA-Z0-9_]
Example: "\w+" matches a "word", i.e., a string of one or more characters that may consist of letters, digits and underscores
\W not a "word" character: [^a-zA-Z0-9_]
\B any character that is not a word-delimiter


Octal and hexadecimal escapes

An octal number can be represented by the octal escape "\0" and maximally three digits from the digit class [0-7]. The octal number should not exceed \0377.
A hexadecimal number can be represented by the octal escape "\x" or "\X"and maximally two characters from the class [0-9A-F]. The maximum hexadecimal number should not exceed \xFF.
Example: \053 and \X2B both specify the "+" character.


Substitution special characters

The substitution string is mostly interpreted as ordinary text except for the special control characters, the octal and hexadecimal escapes and the following character combinations:
\1 ... \9 are backreferences at sub-expressions 1 ... 9 in the match.
Any of the first nine sub-expressions of the match string can be inserted into the replacement string by inserting a `\' followed by a digit from 1 to 9 that represents the string matched by a parenthesized expression within the regular expression. The numbering is left to right.
Example: A search for "(a)(b)" in the string "abc", followed by a replace "\2\1" results in "bac".
& reference at entire match The entire string that was matched by the search operation will be substituted.
Example: a search for "." in the string "abcd" followed by the replace "&&" doubles every character in the result "aabbccdd".
\U \u to uppercase The text inserted by "&" or "\1" ... "\9" is converted to uppercase ("\u" only changes the first character to uppercase).
Example: A search for "(aa)" in the string "aabb", followed by a replace "\U\1bc" results in the string "AAbcbb".
\L \l to lowercase The text inserted by "&" or "\1" ... "\9" is converted to lowercase ("\l" only changes the first character to lowercase).
Example: A search for "(AA)" with a replace "\l\1bc" in the string "AAbb" results in the string "aAbcbb".