Introduction to Regular Expressions
- 1 Introduction
- 2 Special Characters
- 3 Groups and References
- 4 Other Expressions
- 5 Regular Expressions on Writer
- 6 Regular Expressions on Calc
- 7 Regular Expressions on Impress
- 8 More information
will match any digit,
will match an arbitrary integer and
will match an arbitrary number that may (or may not) have a decimal part separated by a dot.
As it is possible to see from these examples, on RegExp several characters have a particular meaning. On this article the main characteristics of the ICU based RegExp engine used by Apache OpenOffice will be commented.
As previously mentioned, several characters have a particular meaning inside regular expressions. These characters are all kind of brackets, the back slash, the dot, the dollar sign, etcetera. In order to search for those special characters we need to "scape" them with a back slash: for example, to search for [ it is needed to write
and of course, in order to search for the back slash, two back slashes must be written
The dot can be used to match a generic character.
will match dat, dot, dut... even dXt.
is equivalent to the dot.
Used to match zero or one instances of the previous character. For example
Will match either mat or math, but not mats.
Used to match one or more instances.
Match an arbitrary number of instances, even zero:
will match mat, math, mathhhhhh...
They can be used to set the number of times the previous character must be repeated. For example
will match ouuch!, ouuuch!, ouuuuch!, ouuuuuch!, but not ouch!
Using only one number will match the exact number
Will only find ouuuch!, while
Will find the word with at least three u.
The ^ sign have several meanings depending on the context. For one part, it can be used to match anything at the beginning of a paragraph. For example, in a paragraph like
The text on the book
will match the first The, but not the second.
Inside square brackets, the ^ is used to negate a character. For example,
will match any character that is not an a, while
will match a character that is not either an a nor a r.
The Dollar Sign $
This symbol have different meanings on different contexts too.
Used alone, it find paragraph breaks, but used with other characters it will find those characters at the end of the paragraph. For example, on the paragraph
Your book is under this book
will find the second instance of book, but not the first one. On the Replace by box, the $ sign have a different meaning that we will see below.
The Back Slash \
As mentioned before, the back slash can be used to scape special symbols in order to search for them, but it can also give a different meaning to normal characters. For example,
will search for a word boundary:
will match the first three characters on just, justice, justly... but not adjust, while
will find the last three characters on common, summon... but not monitor.
will find any character inside a word.
a tab stop
will find a line break, but on the replace box will insert a paragraph break (the same you find with $).
The combination \u followed by an exadecimal number can be used to search for specific characters with its unicode code. For example
will find the Greek character delta δ
The Vertical Bar |
This can be used to set alternatives. For example
will find either mote or more. This can be used on expressions with more options, like [a|b|c].
Groups and References
Grouping expressions with parenthesis, like
gives the possibility to call each expression later on the RegExp formula with a back slash followed by a number: \1 will call the first expression, \2 the second, etcetera.
For example, the regular expression
(note the space between (w+) and the *) can be used to find repeated words: both \b expressions find a word boundary, \w+ find one or more word elements while the \1 call the same elements found on with the group (\w+).
In order to call the text found by the group on the Replace by box, you need to use the dollar sign followed by the expression number instead of the back slash.
For example, searching by
and replacing by
will eliminate any duplicated word on the text.
will find any ASCII character, but not accented character nor numbers, for that you need to use
or the dot, as seen before.
will find a single digit.
will find any kind of space, even non breaking ones.
It is possible to use the dash together with the square brackets to set ranges. For example,
will find any digit but 9.
On the replace box,
will insert the same string found with the Search RegExp.
Regular Expressions on Writer
Regular Expressions on Calc
Search and Replace with Search Dialogue
Regular Expressions on Calc Formulas
Regular Expressions on Impress