Difference between revisions of "Documentation/How Tos/Regular Expressions in Writer"

Revision as of 20:18, 31 October 2007

1 Introduction
2 Where regular expressions may be used in OOo
3 A Simple Example
4 The least you need to know about regular expressions
5 How Regular Expressions are applied in OpenOffice.org
6 Literal characters
7 Special characters
8 Single character match . ?
9 Repeating match + * {m,n}
10 Positional match ^ $ \< \>
11 Alternative matches | [...]
12 POSIX bracket expressions [:alpha:] [:digit:] etc..
13 Full list of OOo regular expressions
14 Troubleshooting and Weird Things in OOo regular expressions

Introduction

In simple terms regular expressions are a clever way to find & replace text. Regular expressions can be both powerful and complex, and it is easy for inexperienced users to make mistakes. We shall only describe some basics here, to allow an inexperienced user to get started.. (But more experienced users might find the Troubleshooting section at the end worthwhile).

A typical use for regular expressions is in finding text in a Writer document; for instance to locate all occurrences of man or woman in your document, you could search using a regular expression which would find both words.

Regular expressions are very common in some areas of computing, and are often known as regex or regexp. Not all regex are the same - so reading the relevant manual is sensible.

Where regular expressions may be used in OOo

In Writer:

Edit - Find & Replace dialog

Edit - Changes - Accept/reject command (Filter tab)

In Calc:

Edit - Find & Replace dialog

Data - Filter - Standard filter

Functions, such as SUMIF, LOOKUP

In Base:

Find Record command

The dialogs that appear when you use the above commands generally have an option to use regular expressions (which is off by default). For example

You should check the status of the regular expression option each time you bring up the dialog, as it defaults to 'off'.

A Simple Example

If you have little or no experience of regular expressions, you may find it easiest to study them in Writer rather than say Calc.

In Writer, bring up the Find and Replace dialog from the Edit menu.

One the dialog, choose More Options and tick the Regular Expressions box

In the Search box enter r.d - the dot here means 'any single character'.

Clicking the Find All button will now find all the places where an r is followed by another character followed by a d, for instance 'red' or 'hotrod 'or 'bride' or 'your dog' (this last example is r followed by a space followed by d - the space is a character).

If you type xxx into the Replace with box, and click the Replace All button, these become 'xxx', 'hotxxx', 'bxxxe', 'youxxxog'

That may not be very useful, but it shows the principle. We'll continue to use the Find and Replace dialog to explain in more detail.

The least you need to know about regular expressions

If you don't want to find out exactly how regular expressions work, but just want to get a job done, try the HowTo: Common tasks with Regular Expressions (in preparation)

Otherwise, read on.

How Regular Expressions are applied in OpenOffice.org

OpenOffice.org regular expressions appear to divide the text to be searched into portions and examine each portion separately.

In Writer, text appears to be divided into paragraphs. For example x.*z will not match x at the end of a paragraph with z beginning the next paragraph ( x.*z means x then any or no characters then z). Paragraphs seem to be treated separately (although we discuss some special cases at the end of this HowTo).

In addition Writer considers each table cell and each text frame separately. Text frames are examined after all the other text / table cells on all pages have been examined.

In the Find & Replace dialog, regular expressions may be used in the Search for box. In general they may not be used in the Replace with box. The exceptions are discussed later.

Literal characters

If your regular expression contains characters other than the so-called 'special characters' . ^ $ * + ? \ [ ( { | then those characters are matched literally.

For example: red matches red redraw and Freddie.

OpenOffice.org allows you to choose whether you care if a character is 'UPPER CASE' or 'lower case'. If you tick the box to 'match case' on the Find and Replace dialog, then red will not match Red or FRED; if you un-tick that box then the case is ignored and both will be matched.

Special characters

The special characters are . ^ $ * + ? \ [ ( { |

They have special meanings in a regular expression, as we're about to describe.

If you wish to match one of these characters literally, place a backslash '\' before it.

For example: to match $100 use \$100 - the \$ is taken to mean $ .

Single character match . ?

The dot '.' special character stands for any single character (except newline).

For example: r.d matches 'red' and 'hotrod' and 'bride' and 'your dog'

The question mark '?' special character means 'match zero or one of the preceding character' - or 'match the preceding character if it is found'.

For example: rea?d matches 'red' and 'read' - 'a?' means 'match a single a if there is one'.

Repeating match + * {m,n}

The plus '+' special character means 'match one or more of the preceding character'.

For example: re+d matches 'red' and 'reed' and 'reeeeed' - e+ means match one or more e's.

The star '*' special character means 'match zero or more of the preceding character'.

For example: rea*d matches 'red' and 'read' and 'reaaaaaaad' - 'a*' means match zero or more a's .

A common use for '*' is after the dot character - ie '.*' which means 'any or no characters'.

Use the star '*' with caution; it will grab everything it can:

For example: 'r.*d' matches 'red' but in Writer if your paragraph is actually 'The referee showed him the red card again' the match found is 'referee showed him the red card' - that is, the first 'r' and the last possible 'd'. Regular expressions are greedy by nature.

You may specify how many times you wish the match to be repeated, with curly brackets { }. For example a{1,4}rgh! will match argh!, aargh!, aaargh! and aaaargh! - in other words between 1 and 4 a's then rgh!. Also note that a{3}rgh! will match precisely 3 a's, ie aaargh!, and a{2,}rgh! will match at least 2 a's a's, for example aargh! and aaaaaaaargh!.

Positional match ^ $ \< \>

The circumflex '^' special character means 'match at the beginning of the text'.

The dollar '$' special character means 'match at the end of the text'.

Remember that OpenOffice.org regular expressions divide up the text to be searched - each paragraph in Writer is examined separately.

For example: ^red matches 'red' at the start of a paragraph (red night shepherd's delight).

For example: red$ matches 'red' at the end of a paragraph (he felt himself go red)

For example: ^red$ matches inside a table cell that contains just 'red'

In addition a hard line break (entered by Shift-Enter) is considered the beginning / end of text, and will allow a ^ or $ match.

The backslash '\' special character gives special meaning to the character pairs '\<' and '\>', namely 'match at the beginning of a word', and 'match at the end of a word'

For example: \<red matches red at the beginning of a word (she went redder than he did).

For example: red\> matches red at the end of a word (although neither of them cared much.)

The test used to define the beginning/end of a word seems to be that the previous/next character is a space, underscore (_), tab, newline, paragraph mark or any non-alphanumeric character.

For example: \<red matches 'person@rediton.com'

For example: red\> matches 'I said, "No-one dared" '

Alternative matches | [...]

The pipe character '|' is a special character which allows the expression either side of the '|' to match.

For example: red|blue matches 'red' and 'blue'

The open square brackets character [ is a special character. Characters enclosed in square brackets are treated as alternatives - any one of them may match. You can also include ranges of characters, such as a-z or 0-9, rather than typing in abcdefghijklmnopqrstuvwxyz or 0123456789

For example: r[eo]d matches 'red' and 'rod' but not 'rid'

For example: [m-p]ut matches 'mut' and 'nut' and 'out' and 'put'

For example: [hm-p]ut matches 'hut' and 'mut' and 'nut' and 'out' and 'put'

Special characters within alternative match square brackets do not have the same special meanings. The only characters which do have special meanings are ], -, ^ and \, and the meanings are:

] - a closing square bracket ends the alternative match set [abcdef]
- - a hyphen indicates a range of characters, as we've seen, eg [0-9]
^ - the caret negates the character or character range. For example [^a-f] matches any character except abcdef.
\ - the backslash is used to allow ], -, ^ and \ to be used literally in square brackets. For example, \] stands for a literal closing square bracket, so [[\]a] will match an opening square bracket [, a closing square bracket ] or an a. \\ stands for a literal backslash.

Just to re-emphasise: these are the meanings of these characters inside square brackets, and any other characters are treated literally. For example [\t ] will match a backslash \, a 't' or a space - not a tab or a space.

POSIX bracket expressions [:alpha:] [:digit:] etc..

There is much confusion in the OpenOffice.org community about these. The Help itself is also far from clear.

There are a number of 'POSIX bracket expressions' (sometimes called 'POSIX character classes') available in OpenOffice.org regular expressions, of the form [:classname:] which allow a match with any of the characters in that class. For instance [:digit:] stands for any of the digits 0123456789.

These (by definition) may only appear inside the square brackets of an alternative match - so a valid syntax would be [abc[:digit:]], which should match a, b, c, or any digit 0-9. A correct syntax to match just any one digit would be [[:digit:]].

Unfortunately this does work as it should! The correct syntax does not work at all, but currently an incorrect syntax ([:digit:]) will actually match a digit, as long as it is outside the square brackets of an alternative match. Obviously this is unsatisfactory, and is the subject of issue 64368.

The POSIX bracket expressions available are listed below. Note that the exact definition of each depends on locale - for example in a different language other characters may be considered 'alphabetic letters' in [:alpha:]. The meanings given here apply generally to English-speaking locales (and do not take into account any Unicode issues).

[:digit:]: stands for any of the digits 0123456789. This is equivalent to 0-9.

[:space:]: should stand for any whitespace character, including tab; however as currently implemented it stands simply for a space character. Note that the Help is currently misleading here.

[:print:]: should stand for any printable character; however as currently implemented it does not match the single quote nor the double quote characters ‘ ’ “ ” (and some others such as « »). It matches space, but does not match tab (this latter is expected/defined behaviour).

[:cntrl:]: stands for a control character. As far as a user is concerned, OpenOffice.org documents have very few control characters; tab and hard_line_break are both matched, but paragraph_mark is not.

[:alpha:]: stands for a letter (including a letter with an accent). For example in the phrase (often used in English, and here given with accents as in the original language) 'déjà vu' all 6 letters will match.

[:alnum:]: stands for a character that satisfies either [:alpha:] or [:digit:]

[:lower:]: stands for a lowercase letter (including a letter with an accent). The case matching does not work unless the Match case box is ticked; if this box is not ticked this expression is equivalent to [:alpha:].

[:upper:]: stands for an uppercase letter (including a letter with an accent). The case matching does not work unless the Match case box is ticked; if this box is not ticked this expression is equivalent to [:alpha:].

There seems to be little consistency in any implementation of POSIX bracket expressions (OOo or elsewhere). One approach is simply to use straightforward character classes - so instead of [[:digit:]] you use [0-9] for example.

Full list of OOo regular expressions

We've discussed some of the things you can do with regular expressions.

Help - OpenOffice.org Help has a full list; however please see Troubleshooting below for some caveats...

Troubleshooting and Weird Things in OOo regular expressions

If you are new to regular expressions, please realise that they can be tricky - if you are not getting the results you expect, you might need to check that you understand well enough. Try to keep regular expressions as simple and unambitious as possible.

On the other hand, there are some 'points of interest' with OOo regular expressions that may surprise experienced users.

There are currently no 'sub-matches' in OOo regular expressions replacements. Most other regex allow a syntax like \1 and \2 to manipulate the first / second group matches found. The target for implementation of sub-matches is OOo2.4. A workaround until then is to use Find all, then immediately use Find/Replace again in the selection. This may (or may not) allow you to do what you want.

The OOo regular expression behaviour when matching paragraph marks and hard line breaks is 'unusual'. This is partly because regular expressions in other software usually deal with ordinary plain text, whereas OOo regular expressions divide the text at paragraph marks and hard line breaks. For whatever reason, this is what you can do:
- $ on its own will match a paragraph mark - and can be replaced by say a space, or indeed nothing, in order to merge two paragraphs together. Note that red$ will match red at the end of a paragraph, and if you replace it with say a space, you simply get a space where red was - and the paragraphs are unaffected - the paragraph mark is not replaced.
- ^$ will match an empty paragraph, which can be replaced by say nothing, in order to remove the empty paragraph. Note that ^red$ matches a paragraph with only 'red' in it - replacing this with nothing leaves an empty paragraph - the paragraph marks at either end are not replaced.
- \n will match a hard line break (Shift-Enter) if it is entered in the Search box. In this context it is simply treated like a character, and can be replaced by say a space, or nothing. The regular expression red\n will match red followed by a hard line break character - and if replaced simply by say blue the hard line break will also be replaced. The regular expression red$ will match red followed by a hard line break. In this case, replacing with blue will only replace red - and will leave the hard line break intact.
- red\ngreen will match red followed by a hard line break followed by green; replacing with say brown will remove the hard line break. However neither red.green nor red.*green will match here!
- If you wish to replace every hard line break with a paragraph mark, firstly you will search for \n with Find All to select the hard line breaks. Then in the Replace box you enter \n, which in the Replace box stands for a paragraph mark and choose Replace All. This is somewhat bizarre, but at least now you know. Note that \r is interpreted as a literal 'r', not a carriage return.

@@ Line 102: / Line 102: @@
-== Single character match (.  ?) ==
+== Single character match .  ? ==
 The dot '<b>.</b>' special character stands for any single character (except newline).
@@ Line 115: / Line 115: @@
-== Repeating match (+  *) ==
+== Repeating match +  *  {m,n} ==
 The plus <b>'+'</b> special character means 'match one or more of the preceding character'.
@@ Line 136: / Line 136: @@
-== Positional match (^  $  \<  \>) ==
+You may specify how many times you wish the match to be repeated, with curly brackets <b>{ }</b>. For example <b>a{1,4}rgh!</b> will match <b>argh!</b>, <b>aargh!</b>, <b>aaargh!</b> and <b>aaaargh!</b> - in other words between 1 and 4 <b>a</b>'s then <b>rgh!</b>. Also note that <b>a{3}rgh!</b> will match precisely 3 <b>a</b>'s, ie <b>aaargh!</b>, and  <b>a{2,}rgh!</b> will match at least 2 <b>a</b>'s <b>a</b>'s, for example <b>aargh!</b> and <b>aaaaaaaargh!</b>.
+== Positional match ^  $  \<  \> ==
 The circumflex <b>'^'</b> special character means 'match at the beginning of the text'.
@@ Line 170: / Line 175: @@
-== Alternative matches ( |  [...] ) ==
+== Alternative matches  |  [...]  ==
 The pipe character '<b>|</b>' is a special character which allows the expression either side of the '<b>|</b>' to match.
@@ Line 196: / Line 201: @@
 Just to re-emphasise: these are the meanings of these characters inside square brackets, and any other characters are treated literally. For example <b>[\t ]</b> will match a backslash <b>\</b>, a '<b>t</b>' or a space - <b>not</b> a tab or a space.
-== POSIX bracket expressions ([:alpha:] [:digit:] etc.. ) ==
+== POSIX bracket expressions [:alpha:] [:digit:] etc..  ==
 There is much confusion in the OpenOffice.org community about these. The Help itself is also far from clear.
@@ Line 249: / Line 255: @@
 ** '''\n''' will match a hard line break (Shift-Enter) if it is entered in the Search box. In this context it is simply treated like a character, and can be replaced by say a '''space''', or nothing. The regular expression '''red\n''' will match '''red''' followed by a hard line break character - and if replaced simply by say '''blue''' the hard line break will also be replaced. The regular expression '''red$''' will match '''red''' followed by a hard line break. In this case, replacing with '''blue''' will only replace '''red''' - and will leave the hard line break intact.
 ** '''red\ngreen''' will match '''red''' followed by a hard line break followed by '''green'''; replacing with say '''brown''' will remove the hard line break. However neither '''red.green''' nor '''red.*green''' will match here!
-** If you wish to replace every hard line break with a paragraph mark, firstly you will search for '''\n''' with Find All to select the hard line breaks. Then in the Replace box you enter '''\n, '''which in the Replace box stands for a paragraph mark and choose Replace All. This is somewhat bizarre, but at least now you know. If you wished to replace something with a hard line break, you could presumably enter it as a \xXXXX hexadecimal code. Note that \r is interpreted as a literal 'r', not a carriage return.<br/>
+** If you wish to replace every hard line break with a paragraph mark, firstly you will search for '''\n''' with Find All to select the hard line breaks. Then in the Replace box you enter '''\n, '''which in the Replace box stands for a paragraph mark and choose Replace All. This is somewhat bizarre, but at least now you know. Note that \r is interpreted as a literal 'r', not a carriage return.<br/>