User:Lvl/Regex course/Chapter 1

From DPWiki
Jump to navigation Jump to search

Chapter 1 -- the bases

Principle

A "regular expression" (sometimes abbreviated "regex") is a way of representing a "pattern" of text, using a particular syntax. The "patterns" are defined character by character, with some constructs that allow specifying various choices, and combinations. For instance, it is possible to express a regular expression representing "one or more digits followed by a comma". Searching using such a regular expression on the following text:

"Item 16, price $12,000,000."

will find three parts in that text corresponding to that pattern. We say that the regular expression will "match" these strings of characters: "16,", "12," and "000,".

There are two kinds of things to learn about regular expressions: thinking in terms of character combinations (e.g. "one or more digits followed by a comma"), and learning the ugly syntax that the computer understands (e.g. '[0-9]+,' or variants).

In this small course, I will represent regular expressions between single quotes 'like this', for clarity (don't include these single quotes in your favorite software!). Other pieces of text will be shown in "double quotes".


String or ordinary characters

The most simple regular expression is a string of ordinary characters. For instance, in the text "Hello to all", the regular expression 'He' matches the two first characters "He". The expression 'o to' also matches a part of this text. But there is no text matched by the regular expression 'hello' (because characters "H" and "h" are considered as different), nor by 'Helloto' (because there is a space in the text).

Exercise: how many places in the text "Welcome to this course" are matched by the regular expression 'co'?

Note on the syntax: actually a regular expression like 'He' is composed of two elements: 'H' which only matches a letter "H", and 'e' which only matches a letter "e". Gluing together these atomic elements creates a more elaborate regular expression 'He' which corresponds to anything matched by the first atom, immediately followed by anything matched by the second atom. It may sound silly to describe such a simple expression like this, but this is true also for more complex regular expressions.

Metacharacters, \

Some characters have a special usage, reserved for the syntax of regular expressions. These are called metacharacters:

{}[]()^$.|*+?\

If we want to search for one of these characters, we put a backslash before it. To match a single dot, we use '\.'; to match a "+" character, we use '\+', etc.

The regular expression '2+2' does not match any string in the text "2+2=4"
'2\+2' matches "2+2".

Note on the syntax: '2\+2' is the juxtaposition of three atoms: '2', '\+' which matches a character "+", and '2'.

Excepting metacharacters, all the other characters are allowed in a regular expression: a, 0, µ, Ñ, §, «, space, etc.

The backslash is also a metacharacter, so to specify a true backslash in a regular expression we put another backslash before it '\\'. Example:

'C:\\autoexec\.bat' matches "C:\autoexec.bat"

Exercise: A regular expression matching the text "G^al"?

Exercise: A regular expression matching "H_{2}O"?

Beginning and end of line ^ $

By default, a regular expression will match strings of characters located at any place in the lines in our text. But we can force the match to take place at the beginning or end of line:

The caret ^ matches the beginning of line.

'^foo' will only find the places in the text where the first three characters in the line are f, o, o in that order.

The dollar $ matches the end of line.

'line\.$' matches the five characters "line." at the end of line.

We can use both:

'^<tb>$' matches a line containing exactly "<tb>", with no characters before or after in its line.
'^$' matches an empty line (nothing between the beginning and the end).

Exercise: find all lines indented by at least two white spaces

Exercise: find lines with a space at the end

Exercise: find lines beginning with "/*"


Note on the syntax: in the regular expression '^\$', we again have two atoms: '^' which only matches at the beginning of the line, and '\$' which matches a "$" character.

Exercise: What does this regular expression mean: '^\\\$'


Character classes [ ] and [^ ]

I said at the beginning that regular expressions allow representing a whole set of possible strings of character. A first way to introduce variety are the character classes.

A character class is a piece of regular expression matching one character chose among a set of possible characters. We note that using square brackets [] with possible characters included inside the brackets.

The best way to show the syntax is using examples:

'[abc123]' means one character among "a", "b", "c", "1", "2", "3".

Inside the brackets it is possible to specify an interval using the minus sign '-':

'[0-9]' one character among "0", "1", …, "9", in the order of character codes; i.e. a digit.
'[A-Za-z]' one character among "A" … "Z", "a" … "z", i.e. one uppercase or lowercase letter (excluding diacritics).

It's also possible to invert the set by placing a ^ immediately after the opening bracket:

'[^abc]' means any character except "a", "b", "c"
'[^0-9]' means any character except a digit.
etc.

So: '[Yy]es' is a regular expression composed of three pieces: '[Yy]', which means one character among "Y" and "y"; 'e' which is one "e"; and 's' which is one "s". '[Yy]es' matches "Yes" and "yes".

'[Nn][Oo]' matches "no", "No", "NO" (but it also matches "nO").


Note on the syntax: I suppose it's not necessary to recall that '[Nn][Oo]' is composed of two regular expression atoms, '[Nn]' followed by '[Oo]'.

Exercise: find words containing two or more uppercase letters followed by a lowercase letter, like "THing", "STuff", "WHAtever".


The metacharacters do not need a backslash inside the character class; but we use the backslash in front of the characters which can cause trouble inside the brackets, which are "-", "^", "]" and "\":

'[\]a]bc' matches "abc" or "]bc".
'[a\-c]d' matches "ad", "-d" or "cd"
'[[\]\\]' means one character among "[", "]", "\".


(Détail: there is no need to put a backslash before - if it is the first or last character inside the brackets, i.e. just before ] or just after [ and [^; Same for ^, if it is not immediately after [ you don't need the backslash:
'[-ab]', '[ab-]' and '[a\-b]' are all equivalent.
'[\^a]', '[a\^]' and '[a^]' are all equivalent.
You can even omit the backslash before ] if it is immediately after [ and [^! (that was used in the first version of regular expressions in the geological ages of computers when the backslash did loose its special meaning inside brackets):
'[]u]' is equivalent to '[\]u]' and '[u\]]'
But that's just pedantry, I advise you to use the backslash for clarity.)


Exercise: Give other ways of writing '[abc123]'?

Exercise: Give a regular expression matching one character among "^" and "_"?

Exercise: Give a regular expression matching any character but "\"?

Exercise: Find a "q" followed by a character other than "u"?

Exercise: Find a punctuation symbol with a space before it

Exercise: Give a regular expression matching a dot followed with a character other than a space, another dot, or a closing parenthesis?

Exercise: Find two consecutive spaces (except spaces at the beginning of the line) Hint: to ignore spaces at the beginning of the line, we look for spaces preceded by a character other than space

Exercise: Find ellipsis with only two dots (except at the beginning or end of line)


Any character .

Last, the dot . is like a character class matching any possible character (except newline). So:

'e..e' matches an "e", followed by two random characters, followed by an "e";
'^...$' matches a line containing exactly three characters

Exercise: Find a line whose third character is a digit

Exercise: Find a line containing whose penultimate character is an "a"


Alternative |

We can write a regular expression matching one thing, or another, by using two sub-expressions separated with a vertical bar |.

'cat|dog' matches "cat" or "dog".

The various branches of the alternative are regular expressions and can use all features of regular expressions. For example

'^a|b.b|[aeiouy]$'

matches an "a" at the beginning of the line (that's '^a'), or two "b" separated by a random character ('b.b'), or a vowel at the end of the line ('[aeiouy]$').

Exercise: What does this regular expression: '^/[*#]$|^[*#]/$'


Grouping ( )

If we want to match "little cat" or "little dog", we can of course use 'little cat|little dog', but it would be easier to limit the span of the alternative to the varying part and to avoid duplicated what is common.

To limit the alternative, we can use round brackets to contain the alternative inside a sub-expression:

'little (cat|dog)' matches "little cat" or "little dog".

Let's decompose: it is made of:

'little ' matches "little" followed with a space;
'()' a sub expression enclosed in round brackets, which is:
'cat|dog' i.e. the alternative between 'cat' or 'dog'

We can combine that and create sub-sub expression at will:

'(little|(very |)large) (cat|dog)' matches "little cat", "very large cat", "large dog", etc.

Exercise: Match a year from 1900 to 2099 included

Exercice: What does this regular expression: '<(/|)([ib]|sc)>'


Replacement

So far we've searched for text matching a given pattern. But is it also possible to perform very powerful replacement, by reusing, in the replacement string, a part of the text matched by the regular expression.

We use grouping for that, putting a part of the regular expression within parentheses. The text that was matched by the first sub-expression can be recalled by putting "$1" in the replacement string, and similarly "$2" is the text matched by the second sub-expression, and so on up to "$9".

Example: 'little (cat)' => "large $1" will replace "little cat" with "large cat".

Example: replace "ahle" and "ihle" with "able" and "ible":

'([ai])hle' => '$1ble'

Example: move "</i>" to the left of a closing parenthesis or guillemet

'([»)])(</i>)' => '$2$1'


Note on the syntax: regular expressions are read from the left to the right. Here, we have first a parenthesis opening the sub-expression (, within this sub-expression a character class [»)], then the closing bracket ), etc. A ) appearing within a character class, or protected by a backslash \) does not end a sub-expression.


Example: Remove a space before the usual punctuation symbols except period

' ([,?;:!])' => "$1"

Exercise: Same, including before ")" and "]"

Exercise: Remove a space before a period not part of an ellipsis

Exercise: Replace a space between two digits with " "


(When creating nested groups, the number of the subexpression corresponds to the order in which the opening parentheses are found in the regular expression, when reading it from left to right. So, $1 will contain the text matched by the sub-expression whose parenthesis is the first, and so on. Here is a silly example with the number indicated below:
   (ab(cd|ef)((gh)|i))
   1  2      34
So, assuming this regular expression matches, $2 will contain "cd" or "ef", $3 will contain "gh" or "i", $4 will contain "gh" or nothing (if the 'i' branch was the one which matched in the second alternative), and $1 will contain "abcdgh", "abcdi", "abefgh" or "abefi".)


Example: '(CHAPTER ([1-9]))' => "<h2><a name="ch_$2">$1</a></h2>"

Quantifiers ? * + { }

The quantifiers ?, *, + and { } allow us to specify the number of possible repeats of a part of regular expression. Quantifiers are put immediately after the regular expression atom (character, backslash-protected character, character class, dot, or sub-expression). Here are the meaning of quantifiers:

  • a? means atom a occurring 0 or 1 time
  • a* means atom a occurring 0 or more times
  • a+ means atom a occurring 1 or more times
  • a{n,m} means atom a occurring at least n times, but no more than m times.
  • a{n,} means atom a occurring at least n times
  • a{n} means atom a occurring exactly n times

It's clearer on an example:

The regular expression 'co*l' is composed of three pieces: 'c', 'o*' meaning an "o" repeated zero times or more, and 'l'. It matches the strings "cool", "col" and "cl" in the following text:

«the cool college club members.»

To match only "cool" and "col" we could use 'co+l', or 'coo*l', or 'co{1,2}l', etc.

Note: what quantifiers repeat actually are the regular expressions, not the particular text matched by them. That means that '[AB]{2}' is equivalent to '[AB][AB]' (i.e. one '[AB]' followed by one '[AB]'), and will match "AA", "AB", "BA" and "BB". It's possible to repeat a particular text (for instance, look for 3 consecutive identical letters) but since that's used very rarely I'll mention it in a next lesson.

More examples:

'[a-z]+ +\.*' one or more lowercase letters, followed with one or more space characters, followed optionnally with any number of dots.
a number greater than zero: '[1-9][0-9]*'
four spaces: ' {4}'
<sc> or </sc>: '</?sc>'
a footnote anchor like "[2]", "[13]", etc.: '\[[0-9]+]'
(note: that will also match strange things like "[0]" ou "[0001]")
a line with at least 73 characters: '.{73,}'

Exercise: Find lines indented with an odd number of spaces

Exercise: Replace ellipses containing four dots or more with three dots "...".

A complex example

Let's build a regular expression matching numbers. We shall start with integer numbers first: 0, 1, 25, etc.

The expression

'[0-9]+'

is a good start, but it also matches "00", which we don't want. The way out is to handle separately the number 0, and to force the first digit to be other than zero for the other numbers, i.e. [1-9] followed with any number of digits.

'0|[1-9][0-9]*'

If now we want to add optional decimals: \.[0-9]+' (and not * because we need at least one digit after the period):

'(0|[1-9][0-9]*)(\.[0-9]+)?'

And if we want also to support negative numbers like -2.5? We could put '-?' in front, but it would also match "-0": so, it looks like this:

'(0|-?[1-9][0-9]*)(\.[0-9]+)?|-0\.[0-9]+'

Note that it still matches "-0.00". If we want to exclude that, we would need to force one digit other than zero in the decimals in the "0\." branch:

'(0|-?[1-9][0-9]*)(\.[0-9]+)?|-0\.[0-9]*[1-9][0-9]*'

As a conclusion, to build our regular expression:

  • we specifyed the task in detail,
  • we identified what was matched in excess, to reduce the scope,
  • we cut the problem into smaller parts
  • we coded the small parts in the regular expression syntax
  • and we combined them.

Note also how being too precise can quickly become very complex. With regular expressions, often it's easier and less error-prone to create a simpler expression that will handle 90% of the problem, and do the other 10% cases by hand.


The danger or greedy quantifiers

Quantifiers are greedy: they try very hard to match as much text as possible. In general in the examples above that was not an issue, when we tried to match specific characters like digits or punctuation symbols. But problems arise when using quantifiers combined with arbitrary character classes.

Let's take an example on the following text

"The cat is fat"

and let's look for '(.*)cat(.*)'

We will obtain $1 = "The " and $2 = " is fat".

But if instead we use the following regular expression: '(.*)at(.*)', we obtain $1 = "The cat is f", and $2 = "". This is because the first quantifier .* swallows the biggest possible part of the text, while allowing the entire expression to match.

What happened exactly? Let's simulate what the computer does when trying to match '(.*)at(.*)'

  1. We start at the beginning, on letter "T".
  2. .* is greedy, we try first with swallowing all the text "The cat is fat".
  3. next, we try to match the remainder of the regular expression: we're now looking for an 'a'. Since there is no "a" after the end of the string, we go backwards one step, at letter "t".
  4. "t" can still not be matched by 'a', so we go on backtracking.
  5. after receading on more step, on the "a", the 'a' can match, and the following 't' can also.
  6. Finally the remaining empty string can be matched by '.*', so we're done.


Avoiding problems with greedy quantifiers

When trying merely to look for an entire line, or for something which can only appear once in each line, the greediness is seldom causing problems; but when on tries replacing something, one must absolutely make sure that only the intended text is matched. To do that one must make sure to give constraints to the greedy quantifiers so as to prevent them from matching too much text.

Example: "between <i>one</i> and <i>two</i>." If we replace '<i>(.*)</i>' into "<em>$1</em>", the .* will swallow all the text "one</i> and <i>two" and after replacing the text will be "between <em>one</i> and <i>two</em>."

So, we must find a way to prevent .* from going past the closing "</i>". The easiest way is to remove the "<" from characters allowed in the repeated pattern:

'<i>([^<]*)</i>'

(Note: that doesn't work if there are nested html tags, like <i>...<sc>...</sc>...</i>. One need more advanced constructs for that, and we will discuss that later.)

There was a simpler method, by the way: We could simply replace '(</?)i>' with "$1em>"... or even simpler, replace 'i>' with "em>"!


Exercise: Remove notes beginning with [** and ending with ]

Exercise: Replace ^ followed by letters and digits, with the same letters and ditits surrounded with ^{…} (e.g. replace "^al," with "^{al},")

Exercise: Replace ^{…} with <sup>…</sup>

Exercise: Replace <a name="truc"> with <a name="truc" id="truc">

Summary

A regular expression is a way of specifying a pattern of text in a concise (and poorly legible) syntax.

We saw how to write atomic expressions:

  • by giving a character directly 'a', or preceded with a backslash in case of metacharacters '\.';
  • by using character classes '[a-z]', '[^abc]' and '.';
  • by mentioning the beginning ^ or end $ of line.

and how to combine these atomic expressions:

  • by putting them one after the other 'ab';
  • by giving alternatives 'a|b';
  • by grouping sub-expressions between parentheses ()
  • repeating atoms and sub-expressions using quantifiers * ? + {}

We learned that sub-expressions are numbered, and how to reuse the text matched by a sub-expression when replacing using $1

Last, we learned that quantifiers are greedy, and that one must be careful to prevent them from matching too much text.