Espressioni regolari in Writer

Introduzione

In pratica, le espressioni regolari sono un modo intelligente per trovare e sostituire del testo (come per i caratteri 'jolly'). Le espressioni regolari possono essere sia potenti sia complesse, ed un utente inesperto può facilmente commettere errori. Descriviamo l'uso delle espressioni regolari in OpenOffice.org al fine di essere abbastanza chiari per i principianti, analizzando dettagliatamente gli aspetti che possono creare confusione negli utenti più esperti.

Un utilizzo tipico di espressioni regolari è cercare del testo in un documento di Writer; per esempio per individuare tutte le occorrenze di uomo o donna nel documento, è possibile cercare usando un'espressione regolare che trovi entrambe le parole.

Le espressioni regolari sono molto comuni in alcuni settori dell'informatica, e sono spesso note come regex o regexp. Non tutte le regex sono scritte allo stesso modo - quindi, una lettura del manuale è una scelta ragionevole.

Quando utilizzare espressioni regolari in OOo

In Writer:

Modifica - finestra Cerca e sostituisci

Modifica - Modifiche - comando Accetta o annulla (Tabella dei filtri)

In Calc:

Modifica - finestra Cerca e sostituisci

Dati - Filtro - Filtro standard e Filtro speciale

Alcune funzioni come SOMMA.SE, CERCA

In Base:

Comando Trova record

Le finestre di dialogo visualizzate quando si utilizzano questi comandi danno generalmente la possibilità di utilizzare le espressioni regolari (per impostazione predefinita questa funzionalità è disattivata). Ad esempio

posizione della casella di controllo delle espressioni regolari

All'apertura della finestra di dialogo occorre controllare lo stato dell'opzione delle espressioni regolari, come impostazione predefinita è disabilitata.

Un semplice esempio

Se non si ha molta dimestichezza con le espressioni regolari, si consiglia di fare pratica in Writer piuttosto che in Calc.

In Writer, aprire il menu di dialogo Trova e sostituisci dal menu Modifica.

Dalla finestra, scegliere Più Opzioni e selezionare la voce Espressioni regolare

Nella casella di ricerca inserire s.g - il punto significa 'ogni singolo carattere'.

Cliccando il bottone Cerca tutto saranno evidenziate tutte la parole che contengono una s seguita da un carattere qualsiasi il cui successore sia però una g, ad esempio 'sig' o 'signore' o segugio' o 's giovanni' (in questo ultimo esempio una s è seguita da uno spazio che a sua volta è seguito da una g - lo spazio è un carattere).

Digitando xxx nel box Sostituisci con e cliccando il bottone Sostituisci tutto, il testo diventerà 'xxx', 'xxxnore', 'xxxugio', 'xxxiovanni'

Tutto ciò potrà sembrare poco utile ma mostra il principio di funzionamento delle espressioni regolari. Tali funzioni saranno spiegate meglio attraverso altri esempi di uso della funzionalità Trova e sostituisci.

Quello che c'è da sapere sulle espressioni regolari

Se non si vuole apprendere il funzionamento esatto delle espressioni regolari, ma si vuole soltanto servirsene, questi esempi potrebbero tornare utili. Inseriteli nel 'Campo di ricerca', assicurandovi che le espressioni regolari siano selezionate.

colore|colori trova colore e colori
sep.rate trova sep seguito da un qualunque carattere e poi da rate - ad esempio separate, seperate, ma anche sepXrate
sep[ae]rate trova separate e seperate - [ae] significa che sarà cercato sia a sia e
sapere? trova saper e sapere - la e è opzionale perché seguita da un punto interrogativo
s\> trova la s alla fine di una parola
\<. trova la prima lettera di una parola.
^. trova la prima lettera di un paragrafo.
^$ trova un paragrafo vuoto

Espressioni regolari applicate in OpenOffice.org

In OpenOffice.org, le espressioni regolari dividono il testo da ricercare in porzioni, per poi esaminare ogni porzione separatamente.

In Writer, il testo è suddiviso in paragrafi. Nell'esempio, cercando x.*z, la x alla fine del paragrafo e la z all'inizio del secondo non saranno visualizzate tra i risultati ( x.*z means x then any or no characters then z). Paragraphs seem to be treated separately (although we discuss some special cases at the end of this HowTo).

In addition Writer considers each table cell and each text frame separately. Text frames are examined after all the other text / table cells on all pages have been examined.

In the Find & Replace dialog, regular expressions may be used in the Search for box. In general they may not be used in the Replace with box. The exceptions are discussed later.

Literal characters

If your regular expression contains characters other than the so-called 'special characters' . ^ $ * + ? \ [ ( { | then those characters are matched literally.

For example: red matches red redraw and Freddie.

OpenOffice.org allows you to choose whether you care if a character is 'UPPER CASE' or 'lower case'. If you tick the box to 'match case' on the Find and Replace dialog, then red will not match Red or FRED; if you un-tick that box then the case is ignored and both will be matched.

Special characters

The special characters are . ^ $ * + ? \ [ ( { |

They have special meanings in a regular expression, as we're about to describe.

If you wish to match one of these characters literally, place a backslash '\' before it.

For example: to match $100 use \$100 - the \$ is taken to mean $ .

Single character match . ?

The dot '.' special character stands for any single character (except newline).

For example: r.d matches 'red' and 'hotrod' and 'bride' and 'your dog'

The question mark '?' special character means 'match zero or one of the preceding character' - or 'match the preceding character if it is found'.

For example: rea?d matches 'red' and 'read' - 'a?' means 'match a single a if there is one'.

Special characters can be used in combination with each other. A dot followed by a question mark means 'match zero or one of any single chacter'.

For example: star.?ing matches 'staring', 'starring', 'starting', and 'starling', but not 'startling'

Repeating match + * {m,n}

The plus '+' special character means 'match one or more of the preceding character'.

For example: re+d matches 'red' and 'reed' and 'reeeeed' - e+ means match one or more e's.

The star '*' special character means 'match zero or more of the preceding character'.

For example: rea*d matches 'red' and 'read' and 'reaaaaaaad' - 'a*' means match zero or more a's .

A common use for '*' is after the dot character - ie '.*' which means 'any or no characters'.

For example: rea.*d matches 'read' and 'reaXd' and 'reaYYYYd' but not - 'red' or 'reXd'

Use the star '*' with caution; it will grab everything it can:

For example: 'r.*d' matches 'red' but in Writer if your paragraph is actually 'The referee showed him the red card again' the match found is 'referee showed him the red card' - that is, the first 'r' and the last possible 'd'. Regular expressions are greedy by nature.

You may specify how many times you wish the match to be repeated, with curly brackets { }. For example a{1,4}rgh! will match argh!, aargh!, aaargh! and aaaargh! - in other words between 1 and 4 a's then rgh!.

Also note that a{3}rgh! will match precisely 3 a's, ie aaargh!, and a{2,}rgh! (with a comma) will match at least 2 a's, for example aargh! and aaaaaaaargh!.

Positional match ^ $ \< \>

The circumflex '^' special character means 'match at the beginning of the text'.

The dollar '$' special character means 'match at the end of the text'.

Remember that OpenOffice.org regular expressions divide up the text to be searched - each paragraph in Writer is examined separately.

For example: ^red matches 'red' at the start of a paragraph (red night shepherd's delight).

For example: red$ matches 'red' at the end of a paragraph (he felt himself go red)

For example: ^red$ matches inside a table cell that contains just 'red'

In addition a hard line break (entered by Shift-Enter) is considered the beginning / end of text, and will allow a ^ or $ match.

The backslash '\' special character gives special meaning to the character pairs '\<' and '\>', namely 'match at the beginning of a word', and 'match at the end of a word'

For example: \<red matches red at the beginning of a word (she went redder than he did).

For example: red\> matches red at the end of a word (although neither of them cared much.)

The test used to define the beginning/end of a word seems to be that the previous/next character is a space, underscore (_), tab, newline, paragraph mark or any non-alphanumeric character.

For example: \<red matches 'person@rediton.com'

For example: red\> matches 'I said, "No-one dared" '

Alternative matches | [...]

The pipe character '|' is a special character which allows the expression either side of the '|' to match.

For example: red|blue matches 'red' and 'blue'

Unfortunately, certain expressions when used after a pipe are not evaluated. This is so far known to affect ^ and backreferences, and is the subject of issue 46165

For example: ^red|blue matches paragraphs beginning with 'red' and any occurrence of 'blue', but blue|^red incorrectly matches only any occurrence of 'blue', failing to match paragraphs beginning with 'red'

The open square brackets character [ is a special character. Characters enclosed in square brackets are treated as alternatives - any one of them may match. You can also include ranges of characters, such as a-z or 0-9, rather than typing in abcdefghijklmnopqrstuvwxyz or 0123456789

For example: r[eo]d matches 'red' and 'rod' but not 'rid'

For example: [m-p]ut matches 'mut' and 'nut' and 'out' and 'put'

For example: [hm-p]ut matches 'hut' and 'mut' and 'nut' and 'out' and 'put'

Special characters within alternative match square brackets do not have the same special meanings. The only characters which do have special meanings are ], -, ^ and \, and the meanings are:

] - a closing square bracket ends the alternative match set [abcdef]
- - a hyphen indicates a range of characters, as we've seen, eg [0-9]
^ - if the caret is the first character in the square brackets, it negates the search. For example [^a-dxyz] matches any character except abcdxyz.
\ - the backslash is used to allow ], -, ^ and \ to be used literally in square brackets, and to allow hexadecimal codes. For example, \] stands for a literal closing square bracket, so [[\]a] will match an opening square bracket [, a closing square bracket ] or an a. \\ stands for a literal backslash. \x0009 stands for a tab character.

Just to re-emphasise: these are the meanings of these characters inside square brackets, and any other characters are treated literally. For example [\t ] will match a 't' or a space - not a tab or a space. Use [\x0009 ] to match a tab or a space.

POSIX bracket expressions [:alpha:] [:digit:] etc..

There is much confusion in the OpenOffice.org community about these. The Help itself is also far from clear.

There are a number of 'POSIX bracket expressions' (sometimes called 'POSIX character classes') available in OpenOffice.org regular expressions, of the form [:classname:] which allow a match with any of the characters in that class. For instance [:digit:] stands for any of the digits 0123456789.

These (by definition) may only appear inside the square brackets of an alternative match - so a valid syntax would be [abc[:digit:]], which should match a, b, c, or any digit 0-9. A correct syntax to match just any one digit would be [[:digit:]].

Unfortunately this does not work as it should! The correct syntax does not work at all, but currently an incorrect syntax ([:digit:]) will actually match a digit, as long as it is outside the square brackets of an alternative match. (Obviously this is unsatisfactory, and is the subject of issue 64368).

The POSIX bracket expressions available are listed below. Note that the exact definition of each depends on locale - for example in a different language other characters may be considered 'alphabetic letters' in [:alpha:]. The meanings given here apply generally to English-speaking locales (and do not take into account any Unicode issues).

[:digit:]: stands for any of the digits 0123456789. This is equivalent to 0-9.

[:space:]: should stand for any whitespace character, including tab; however as currently implemented it stands simply for a space character. Note that the Help is currently misleading here. (This is the subject of issue 41706).

[:print:]: should stand for any printable character; however as currently implemented it does not match the single quote nor the double quote characters ‘ ’ “ ” (and some others such as « »). It matches space, but does not match tab (this latter is expected/defined behaviour). (This is the subject of issue 83290).

[:cntrl:]: stands for a control character. As far as a user is concerned, OpenOffice.org documents have very few control characters; tab and hard_line_break are both matched, but paragraph_mark is not.

[:alpha:]: stands for a letter (including a letter with an accent). For example in the phrase (often used in English, and here given with accents as in the original language) 'déjà vu' all 6 letters will match.

[:alnum:]: stands for a character that satisfies either [:alpha:] or [:digit:]

[:lower:]: stands for a lowercase letter (including a letter with an accent). The case matching does not work unless the Match case box is ticked; if this box is not ticked this expression is equivalent to [:alpha:].

[:upper:]: stands for an uppercase letter (including a letter with an accent). The case matching does not work unless the Match case box is ticked; if this box is not ticked this expression is equivalent to [:alpha:].

There seems to be little consistency in any implementation of POSIX bracket expressions (OOo or elsewhere). One approach is simply to use straightforward character classes - so instead of [[:digit:]] you use [0-9] for example.

Grouping (...) and backreferences \x $x

Round brackets ( ) may be used to group terms.

For example: red(den)? will find 'red' and 'redden'; here (den)? means 'one or zero of den'.

For example: (blue|black)bird will find both 'bluebird' and 'blackbird'.

Each group enclosed in round brackets is also defined as a reference, and can be referred to later in the same expression using a 'backreference'. In the 'Search for' box, backreferences are written '\1', '\2', etc.; in the 'Replace with' box they are written '$1', '$2', etc.

'\1' or '$1' stands for 'whatever matched in the first round brackets'; '\2' or '$2' stands for 'whatever matched in the second round brackets'; and so on.

For example: (blue|black) \1bird in the 'Search for' box will find both 'blue bluebird' and 'black blackbird', because '\1' stands for either blue or black, whichever we found. Therefore 'black bluebird' does not match.

Backreferences in the 'Replace with' box only work from OOo2.4 onwards. The use of $1 rather than \1 is consistent with perl syntax, and more particularly with the ICU regex engine, which may at some time replace the existing OOo regex engine, thus resolving many issues.

For example: (gr..n)(blu.) in the 'Search for' box will find 'greenblue'; if the 'Replace with' box has $2$1 the replacement will be 'bluegreen'.

When regular expressions are selected, to replace text with the literal character '$' you must now use '\$'; similarly for '\' use '\\'.

For example: (1..) in the 'Search for' box and \$$1 in the 'Replace with' box replaces '100' with '$100', and '150' with '$150'.

$0 in the 'Replace with' box replaces with the entire text found.

Tabs, newlines, paragraphs \t \n $

The character pair '\t' has special meaning - it stands for a tab character.

For example: \tred will match a tab character followed by the word 'red'.

In Writer a newline may be entered by pressing Shift-Enter. A newline character is thereby inserted into the text, and the following text starts on a new line. This is not the same as a new paragraph; click View-Non printing characters to see the difference.

The OOo regular expression behaviour when matching paragraph marks and newline characters is 'unusual'. This is partly because regular expressions in other software usually deal with ordinary plain text, whereas OOo regular expressions divide the text at paragraph marks. For whatever reason, this is what you can do:

\n will match a newline (Shift-Enter) if it is entered in the Search box. In this context it is simply treated like a character, and can be replaced by say a space, or nothing. The regular expression red\n will match red followed by a newline character - and if replaced simply by say blue the newline will also be replaced. The regular expression red$ will match 'red' when it is followed by a newline. In this case, replacing with 'blue' will only replace 'red' - and will leave the newline intact.
red\ngreen will match 'red' followed by a newline followed by 'green'; replacing with say 'brown' will remove the newline. However neither red.green nor red.*green will match here - the dot . does not match newline.
$ on its own will match a paragraph mark - and can be replaced by say a 'space', or indeed nothing, in order to merge two paragraphs together. Note that red$ will match 'red' at the end of a paragraph, and if you replace it with say a space, you simply get a space where 'red' was - and the paragraphs are unaffected - the paragraph mark is not replaced. It may help to regard $ on its own as a special syntax, unique to OOo.
^$ will match an empty paragraph, which can be replaced by say nothing, in order to remove the empty paragraph. Note that ^red$ matches a paragraph with only 'red' in it - replacing this with nothing leaves an empty paragraph - the paragraph marks at either end are not replaced. It may help to regard ^$ on its own as a special syntax, unique to OOo. Unfortunately, because OOo has taken over this syntax, it seems you cannot use ^$ to find empty cells in a table (nor empty Calc cells).
If you wish to replace every newline with a paragraph mark, firstly you will search for \n with Find All to select the newlines. Then in the Replace box you enter \n, which in the Replace box stands for a paragraph mark; then choose Replace. This is somewhat bizarre, but at least now you know. Note that \r is interpreted as a literal 'r', not a carriage return.

To replace paragraph marks - as used to give lines a certain length in some html documents, for instance - with "normal" automatically wrapped lines and paragraphs, the following 3 steps should help. Don't forget to choose More Options and tick the Regular Expressions box for this procedure.

1. So as not to lose "normal" paragraph marks at the end of "normal" paragraphs, replace two consecutive paragraph marks using a sequence of characters not occurring anywhere else in the text, like "*****" to replace an empty paragraph - this makes it easy to find and reinstate later. You do this by putting ^$ in the Find box and "*****" in the Replace box. (If you're only dealing with a limited chunk of text, don't forget to check "current selection only" under "more options" in the Find and Replace box.)

2. Search for the remaining line-end paragraph marks by putting $ in the Find box. To replace the mark with a "space" just type a space in the Replace dialogue.

3. Now that the text is ready for normal line-wrapping, put back the "normal" paragraph marks by typing "*****" in the Find box and \n in the Replace box. (Remember to check "current selection only" where appropriate!)

Before you try this, create a test document to practise on.

This is a good sequence to make into a macro. You can find macro suggestions on this OOo forum page: "replacing hard paragraphs".

(This procedure also helps deal indirectly with line-break problems.)

Hexadecimal codes \xXXXX

The character sequence ' \x then a 4 digit hexadecimal number ' stands for the character with that code.

For example: \x002A stands for the star character '*'.

Hexadecimal codes can be seen on the 'Insert-Special Character' dialog.

The 'Replace with' box \t \n & $1 $2

Users are sometimes confused with what can be done using the 'Replace with' box in a Find & Replace dialog.

In general, regular expressions do not work in the 'Replace with' box. The characters you type replace the found text literally.

The four constructs that do work are:

\t inserts a tab, replacing the text found.
\n inserts a paragraph mark, replacing the text found. This may be unexpected, because \n in the 'Search for' box means 'newline'! In some operating systems it is possible to use unicode input to directly type a newline character (U+000A) in the 'Replace with' box, providing a workaround, but this is not universal.
$1, $2, etc are backreferences, which (from OOo2.4) insert text groups found. See under Grouping and backreferences. $0 inserts the entire text found.
& also inserts the entire text found.

For example if you searched for bird|berry, you would would find either 'bird' or 'berry'; now to replace with black& would give you either 'blackbird' or 'blackberry'.

Troubleshooting OOo regular expressions

If you are new to regular expressions, please realise that they can be tricky - if you are not getting the results you expect, you might need to check that you understand well enough. Try to keep regular expressions as simple and unambitious as possible.

Here are some further points of interest with OOo regular expressions:

If you find an unexpected behaviour, please check in the relevant section in this HowTo - many of the behaviour issues have been documented here.
Regular expressions are 'greedy' - that is they will match as much text as they can. Consider using curly and square brackets; for example [^ ]{1,5}\> matches 1 to 5 non-space characters at the end of a word.
Please be careful when using the Replace All button. There are a few rare occasions when this will give unexpected results. For example to remove the first character of every paragraph you might 'Search for' ^. and 'Replace with' nothing; clicking 'Replace All' now will wipe out *all* your text, instead of just the first character of each paragraph. Issue 82473 discusses this. The workaround is to 'Find All', then 'Replace'; perhaps the safest way is not to use the 'Replace All' button at all with regular expressions.

Tips and Tricks

Here are some examples that may be useful:

\<([^ ]+)[ ]+\1

finds duplicate words separated by spaces (note that there is a space before each ])

\<[:alpha:]*\>

finds any word in the whole document (notice:the check box regular expression must by checkt)

\<[1-9][0-9]*\>

finds decimal numbers

\<0[0-7]*\>

finds octal (base 8) numbers

\<0x[A-Fa-f0-9]+\>

finds hexadecimal (base 16) numbers

[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-z]{2,6}

finds most email addresses (there is no perfect regular expression - this is a practical solution)

Template:Documentation/SeeAlso