Regularni izrazi in Writer

1 Uvod
2 Kje lahko v OpenOffice.org uporabim regularne izraze
3 Enostaven primer
4 Najmanj, kar morate vedeti o regularnih izrazih
5 Kako so regularni izrazi obravnavani v OpenOffice.org
6 Dejanski znaki
7 Posebni znaki
8 Ujemanje posameznega znaka . ?
9 Ponavljajoče ujemanje + * {m,n}
10 Ujemanje položaja ^ $ \< \>
11 Alternative matches | [...]
12 POSIX bracket expressions [:alpha:] [:digit:] etc..
13 Združevanje (...) in povratni sklici \x
14 Tabulatorji, nove vrstice, novi odstavki \t \n $
15 Šestnajstiške kode \xXXXX
16 Polje 'Zamenjaj z' \t \n &
17 Reševanje težav z regularnimi izrazi v OpenOffice.org
18 Nasveti in triki
19 Zunanje povezave

Uvod

Enostavno rečeno so regularni izrazi pameten način iskanja in zamenjevanja besedila (podobno kot 'wildcards'). Regularni izrazi so lahko tako uporabni kot zapleteni, zato se neizkušeni uporabniki kaj hitro zmotijo. Tukaj je opisana raba regularnih izrazov OpenOffice.org, kar želimo narediti dovolj jasno za začetnike, medtem ko so bolj podrobni vidiki, ki lahko neukega bralca zmedejo, prepuščeni bolj izkušenim uporabnikom.

Tipično uporabljamo regularne izraze za iskanje besedila v dokumentih programa Writer; če želite npr. poiskati vse pojavitve besed gospod ali gospa v svojem dokumentu, bi lahko iskali z regularnim izrazom, ki bi našel pojavitve obeh besed.

Regularni izrazi so zelo pogosti na določenih področjih računanja in jih pogosto imenujejo regex ali regexp (okrajšano za 'regular expression'). Ker niso med seboj povsem enaki, je branje ustreznega priročnika povsem smiselno.

Kje lahko v OpenOffice.org uporabim regularne izraze

V modulu Writer:

Uredi - pogovorno okno Najdi in zamenjaj

Uredi - Spremembe - ukaz Sprejmi/Zavrni (zavihek Filter)

V modulu Calc:

Uredi - pogovorno okno Najdi in zamenjaj

Podatki - Filtriraj - Standardni filter oz. Napredni filter

Pri določenih funkcijah, kot sta SUMIF in LOOKUP

V modulu Base:

ukaz Najdi zapis

Pogovorna okna, ki se pojavijo, če uporabite zgornje ukaze, imajo možnost uporabe regularnih izrazov (ki je privzeto izključena). Primer:

položaj potrditvenega polja za regularne izraze

Vedno, ko odprete to pogovorno okno, preverite stanje možnosti za regularne izraze, saj je privzeto 'izključeno'.

Enostaven primer

Če z regularnimi izrazi nimate izkušenj, jih boste najlažje preučili v modulu Writer, ne pa modulu Calc.

V modulu Writer odprite pogovorno okno Najdi in zamenjaj v meniju Uredi.

V pogovornem oknu izberite Več možnosti in potrdite polje Regularni izrazi.

V polje Search vnesite r.k - pika na tem mestu predstavlja 'poljuben posamezen znak'.

S klikom gumba Najdi vse boste našli vsa mesta, kjer črki r sledi nek drug znak, nato pa še k, npr. 'rik' ali 'burek' ali 'korakali' ali 'moker kamen' (v tem zadnjem primeru r sledi presledek, nato k - presledek šteje kot znak).

Če vnesete xxx v polje Zamenjaj z in kliknete gumb Zamenjaj vse, najdeno spremenite v 'xxx', 'buxxx', 'koxxxali', 'mokexxxamen'

To morda ni preveč uporabno, razloži pa načela delovanja. V nadaljnjih primerih bomo uporabo pogovornega okna Najdi in zamenjaj razložili še podrobneje.

Najmanj, kar morate vedeti o regularnih izrazih

Tudi če vas ne zanima, kako dejansko delujejo regularni izrazi, želite pa le opraviti svoje delo, vam bodo naslednji pogosti primeri v pomoč. Vnesite jih v polje 'Išči' in preverite, da so regularni izrazi potrjeni.

krom|Cr najde krom in Cr
obl.ka najde obl, kateremu sledi poljuben znak, nato pa ka - npr. obleka, oblika in celo oblXka
kr[ae]ma najde krema in krama - [ae] pomeni a ali e
spremenjeni? poišče spremenjen in spremenjeni - i je neobvezen, saj mu sledi vprašaj.
s\> poišče s na koncu besed.
\<. poišče prvo črko besede.
^. poišče prvo črko odstavka.
^$ poišče prazen odstavek.

Kako so regularni izrazi obravnavani v OpenOffice.org

Regularni izrazi OpenOffice.org razbijejo besedilo, po katerem iščemo, na kose in preiščejo vsak kos ločeno.

V modulu Writer se besedilo videti razdeljeno na odstavke. Primer: x.*z se ne ujema z x na koncu odstavka in z z, ki začenja naslednji odstavek ( x.*z pomeni x, nato katerikoli ali noben znak, nato z). Odstavki so obravnavani ločeno (čeprav obravnavamo nekaj posebnih primerov na koncu tega KakOOojčka).

Poleg tega Writer obravnava vsako celico tabele in vsak okvir z besedilom ločeno. Besedilni okviri se pregledajo po tem, ko je bilo pregledano vse besedilo in vse celice tabel na vseh straneh.

V pogovornem oknu Najdi in zamenjaj lahko regularne izraze uporabite v polju Išči. V splošnem jih v polju Zamenjaj z ne smemo uporabljati. Izjeme bomo obravnavali kasneje.

Dejanski znaki

Če vaš regularni izraz vsebuje znake, ki niso 'posebni znaki' . ^ $ * + ? \ [ ( { |, lahko te znake zapišete neposredno.

Primer: red se ujema z red, redraw in Freddie.

OpenOffice.org vam omogoča izbrati, če želite, da je znak 'VELIKA ČRKA' ali 'mala črka'. Če potrdite polje 'Razlikuj med velikimi in malimi črkami' v pogovornem oknu Najdi in zamenjaj, potem red ne bo ujemalo z Red ali FRED; če polja ne označite, bo velikost črke prezrta in bosta obe matched.

Posebni znaki

Posebni znaki so . ^ $ * + ? \ [ ( { |

V regularnih izrazih imajo poseben pomen, kar boste izvedeli kasneje.

Če želite dejansko poiskati enega od teh znakov, postavite predenj levo poševnico '\'.

Primer: če želite najti $100, uporabite \$100 - v tem primeru \$ predstavlja znak $.

Ujemanje posameznega znaka . ?

Poseben znak pika '.' predstavlja poljuben znak (z izjemo preloma vrstice).

Primer: r.d se ujema z 'red' in 'hotrod' in 'bride' in 'your dog'

Poseben znak vprašaj '?' pomeni 'match zero or one of the preceding character' - ali 'match the preceding character if it is found'.

Primer: rea?d se ujema z 'red' in 'read' - 'a?' pomeni 'match a single a if there is one'.

Posebne znake lahko kombinirate med seboj. Pika, ki ji sledi vprašaj, pomeni 'match zero or one of any single chacter'.

Primer: star.?ing se ujema z 'staring', 'starring', 'starting' in 'starling', ne pa tudi z 'startling'

Ponavljajoče ujemanje + * {m,n}

Poseben znak plus '+' pomeni 'match one or more of the preceding character'.

Primer: re+d matches 'red' and 'reed' and 'reeeeed' - e+ means match one or more e's.

Posebni znak zvezdica '*' pomeni 'match zero or more of the preceding character'.

Primer: rea*d se ujema z 'red' in 'read' in 'reaaaaaaad' - 'a*' pomeni ujemanje z nič ali več a-ji.

Pogosta raba '*' je po znaku za piko - t.j. '.*', kar pomeni 'katerikoli ali noben znak'.

Primer: rea.*d se ujema z 'read' in 'reaXd' in 'reaYYYYd', ne pa tudi - 'red' ali 'reXd'

Zvezdico '*' uporabljajte previdno; zajela bo vse, kar lahko:

Primer: 'r.*d' se ujema z 'red', toda v programu Writer if your paragraph is actually 'The referee showed him the red card again' the match found is 'referee showed him the red card' - that is, the first 'r' and the last possible 'd'. Regularni izrazi so po naravi požrešni.

You may specify how many times you wish the match to be repeated, with curly brackets { }. For example a{1,4}rgh! will match argh!, aargh!, aaargh! and aaaargh! - in other words between 1 and 4 a's then rgh!.

Upoštevajte tudi, da se bo a{3}rgh! ujemal z natanko 3 a-ji, t.j. aaargh!, a{2,}rgh! (z vejico) pa se bo ujemal z najmanj 2 a-ji, npr. aargh! in aaaaaaaargh!.

Ujemanje položaja ^ $ \< \>

Poseben znak circumflex '^' pomeni 'ujemanje na začetku besedila'.

Poseben znak dolar '$' pomeni 'ujemanje na koncu besedila'.

Zapomnite si, da regularni izrazi OpenOffice.org divide up the text to be searched - vsak odstavek v modulu Writer is examined separately.

Primer: ^red poišče 'red' na začetku odstavka (red night shepherd's delight).

Primer: red$ poišče 'red' na koncu odstavka (he felt himself go red)

Primer: ^red$ matches inside a table cell, ki vsebuje le 'red'

In addition a hard line break (vnesemo ga s preslednica+Enter) is considered the beginning / end of text, and will allow a ^ or $ match.

The backslash '\' special character gives special meaning to the character pairs '\<' and '\>', namely 'match at the beginning of a word', and 'match at the end of a word'

Primer: \<red poišče red na začetku besede (she went redder than he did).

Primer: red\> poišče red na koncu besede (although neither of them cared much.)

Preizkus za določanje začetka/konca besede je pogoj, da je prejšnji/naslednji znak presledek, podčrtaj (_), tabulator, nova vrstica, oznaka odstavka ali katerikoli drug ne alfa numerični znak.

Primer: \<red se ujema z 'person@rediton.com'

Primer: red\> se ujema z 'Rekel sem: "Nihče dared" '

Alternative matches | [...]

The pipe character '|' is a special character which allows the expression either side of the '|' to match.

For example: red|blue matches 'red' and 'blue'

Unfortunately, certain expressions when used after a pipe are not evaluated. This is so far known to affect ^ and backreferences, and is the subject of issue 84828

For example: ^red|blue matches paragraphs beginning with 'red' and any occurrence of 'blue', but blue|^red incorrectly matches only any occurrence of 'blue', failing to match paragraphs beginning with 'red'

The open square brackets character [ is a special character. Characters enclosed in square brackets are treated as alternatives - any one of them may match. You can also include ranges of characters, such as a-z or 0-9, rather than typing in abcdefghijklmnopqrstuvwxyz or 0123456789

For example: r[eo]d matches 'red' and 'rod' but not 'rid'

For example: [m-p]ut matches 'mut' and 'nut' and 'out' and 'put'

For example: [hm-p]ut matches 'hut' and 'mut' and 'nut' and 'out' and 'put'

Special characters within alternative match square brackets do not have the same special meanings. The only characters which do have special meanings are ], -, ^ and \, and the meanings are:

] - a closing square bracket ends the alternative match set [abcdef]
- - a hyphen indicates a range of characters, as we've seen, eg [0-9]
^ - if the caret is the first character in the square brackets, it negates the search. For example [^a-dxyz] matches any character except abcdxyz.
\ - the backslash is used to allow ], -, ^ and \ to be used literally in square brackets, and to allow hexadecimal codes. For example, \] stands for a literal closing square bracket, so [[\]a] will match an opening square bracket [, a closing square bracket ] or an a. \\ stands for a literal backslash. \x0009 stands for a tab character.

Just to re-emphasise: these are the meanings of these characters inside square brackets, and any other characters are treated literally. For example [\t ] will match a 't' or a space - not a tab or a space. Use [\x0009 ] to match a tab or a space.

POSIX bracket expressions [:alpha:] [:digit:] etc..

There is much confusion in the OpenOffice.org community about these. The Help itself is also far from clear.

There are a number of 'POSIX bracket expressions' (sometimes called 'POSIX character classes') available in OpenOffice.org regular expressions, of the form [:classname:] which allow a match with any of the characters in that class. For instance [:digit:] stands for any of the digits 0123456789.

These (by definition) may only appear inside the square brackets of an alternative match - so a valid syntax would be [abc[:digit:]], which should match a, b, c, or any digit 0-9. A correct syntax to match just any one digit would be [[:digit:]].

Unfortunately this does not work as it should! The correct syntax does not work at all, but currently an incorrect syntax ([:digit:]) will actually match a digit, as long as it is outside the square brackets of an alternative match. (Obviously this is unsatisfactory, and is the subject of issue 64368).

The POSIX bracket expressions available are listed below. Note that the exact definition of each depends on locale - for example in a different language other characters may be considered 'alphabetic letters' in [:alpha:]. The meanings given here apply generally to English-speaking locales (and do not take into account any Unicode issues).

[:digit:]: stands for any of the digits 0123456789. This is equivalent to 0-9.

[:space:]: should stand for any whitespace character, including tab; however as currently implemented it stands simply for a space character. Note that the Help is currently misleading here. (This is the subject of issue 41706).

[:print:]: should stand for any printable character; however as currently implemented it does not match the single quote nor the double quote characters ‘ ’ “ ” (and some others such as « »). It matches space, but does not match tab (this latter is expected/defined behaviour). (This is the subject of issue 83290).

[:cntrl:]: stands for a control character. As far as a user is concerned, OpenOffice.org documents have very few control characters; tab and hard_line_break are both matched, but paragraph_mark is not.

[:alpha:]: stands for a letter (including a letter with an accent). For example in the phrase (often used in English, and here given with accents as in the original language) 'déjà vu' all 6 letters will match.

[:alnum:]: stands for a character that satisfies either [:alpha:] or [:digit:]

[:lower:]: stands for a lowercase letter (including a letter with an accent). The case matching does not work unless the Match case box is ticked; if this box is not ticked this expression is equivalent to [:alpha:].

[:upper:]: stands for an uppercase letter (including a letter with an accent). The case matching does not work unless the Match case box is ticked; if this box is not ticked this expression is equivalent to [:alpha:].

There seems to be little consistency in any implementation of POSIX bracket expressions (OOo or elsewhere). One approach is simply to use straightforward character classes - so instead of [[:digit:]] you use [0-9] for example.

Združevanje (...) in povratni sklici \x

Okrogle oklepaje in zaklepaje ( ) uporabljamo za združevanje pogojev.

Primer: red(den)? najde 'red' in 'redden'; tukaj (den)? pomeni 'one or zero of den'.

Primer: (blue|black)bird najde tako 'bluebird' kot 'blackbird'.

Each group enclosed in round brackets is also defined as a reference, and can be referred to later in the same expression using a 'backreference'. The backreference '\1' stands for 'whatever matched in the first round brackets'; '\2' stands for 'whatever matched in the second round brackets'; and so on.

Primer: (blue|black) \1bird will find both 'blue bluebird' and 'black blackbird', because '\1' stands for either blue or black, whichever we found. Therefore 'black bluebird' does not match.

Please note that backreferences may only be used in the 'Search for' box at present, not in the 'Replace with' box.

The target for implementation of backreferences in the 'Replace with' box is OOo2.4. A workaround until then is to use Find all, then immediately use Find/Replace again in the current selection only. This may (or may not) allow you to do what you want.

(technical note: issue 15666 covers this. Backreferences in the 'Replace with' box will be $1, $2, $3 etc. This is consistent with perl syntax, and more particularly with the ICU regex engine, which may at some time replace the existing OOo regex engine, thus resolving many issues.) [[Naslov povezave]]

Tabulatorji, nove vrstice, novi odstavki \t \n $

Par znakov '\t' ima poseben pomen - predstavlja tabulatorski znak.

Primer: \trdeče se bo ujemalo z znakom tabulator, ki mu sledi beseda 'rdeče'.

V modulu Writer lahko novo vrstico vnesete s kombinacijo tipk Shift-Enter. Znak za novo vrstico je tako vstavljen v besedilo in besedilo, ki sledi, se začne v novi vrstici. To ni enako novemu odstavku; kliknite Pogled-Nenatisljivi znaki, če bi radi videli razliko.

The OOo regular expression behaviour when matching paragraph marks and newline characters is 'unusual'. This is partly because regular expressions in other software usually deal with ordinary plain text, whereas OOo regular expressions divide the text at paragraph marks. For whatever reason, this is what you can do:

\n will match a newline (Shift-Enter) if it is entered in the Search box. In this context it is simply treated like a character, and can be replaced by say a space, or nothing. The regular expression red\n will match red followed by a newline character - and if replaced simply by say blue the newline will also be replaced. The regular expression red$ will match 'red' when it is followed by a newline. In this case, replacing with 'blue' will only replace 'red' - and will leave the newline intact.
red\ngreen will match 'red' followed by a newline followed by 'green'; replacing with say 'brown' will remove the newline. However neither red.green nor red.*green will match here - the dot . does not match newline.
$ on its own will match a paragraph mark - and can be replaced by say a 'space', or indeed nothing, in order to merge two paragraphs together. Note that red$ will match 'red' at the end of a paragraph, and if you replace it with say a space, you simply get a space where 'red' was - and the paragraphs are unaffected - the paragraph mark is not replaced. It may help to regard $ on its own as a special syntax, unique to OOo.
^$ will match an empty paragraph, which can be replaced by say nothing, in order to remove the empty paragraph. Note that ^red$ matches a paragraph with only 'red' in it - replacing this with nothing leaves an empty paragraph - the paragraph marks at either end are not replaced. It may help to regard ^$ on its own as a special syntax, unique to OOo. Unfortunately, because OOo has taken over this syntax, it seems you cannot use ^$ to find empty cells in a table (nor empty Calc cells).
If you wish to replace every newline with a paragraph mark, firstly you will search for \n with Find All to select the newlines. Then in the Replace box you enter \n, which in the Replace box stands for a paragraph mark; then choose Replace. This is somewhat bizarre, but at least now you know. Note that \r is interpreted as a literal 'r', not a carriage return.

Šestnajstiške kode \xXXXX

Zaporedje znakov, ki se začne z ' \x, ki mu sledi 4-mestno šestnajstiško število', pomeni znak s to kodo.

Primer: \x002A predstavlja znak za zvezdico, '*'.

Šestnajstiške kode lahko vidite v pogovornem oknu 'Vstavi-Poseben znak'.

Polje 'Zamenjaj z' \t \n &

Uporabnike včasih zmede, kaj lahko počnejo s poljem 'Zamenjaj z' v pogovornem oknu Najdi in zamenjaj.

V splošnem regularni izrazi ne delujejo v polju 'Zamenjaj z'. Znaki, ki jih vnesete, dejansko zamenjajo najdeno besedilo.

Trije konstrukti, ki pa vseeno delujejo:

\t vstavi tabulator, s katerim zamenja najdeno besedilo.
\n vstavi oznako za nov odstavek, s katerim zamenja najdeno besedilo. To je lahko nepričakovano, kajti \n v polju 'Išči' pomeni 'nov odstavek'! V nekaterih operacijskih sistemih je mogoče uporabiti obliko unicode za neposreden vnos znaka za novo vrstico (U+000A) v polje 'Zamenjaj z', kar ponuja workaround, vendar to ni univerzalna rešitev.
& vstavi celotno najdeno besedilo.

Primer: če ste iskali bird|berry, bi našli 'bird' ali 'berry'; now to replace with black& would give you either 'blackbird' or 'blackberry'.

Upoštevajte, da povratni sklici ali podzadetki, kot so \1, \2, še niso na voljo v polju 'Zamenjaj z' - oglejte si opombo v zgornjem odseku Združevanje in povratni sklici.

Reševanje težav z regularnimi izrazi v OpenOffice.org

Če regularnih izrazov ne poznate, se zavedajte, da so lahko tricky - če ne dobite pričakovanih rezultatov, preverite, da jih dovolj dobro razumete. Poskusite ohraniti regularne izraze karseda enostavne in nepretenciozne.

Tukaj je nekaj dodatnih points of interest o regularnih izrazih OpenOffice.org:

If you find an unexpected behaviour, please check in the relevant section in this HowTo - many of the behaviour issues have been documented here.
Regularni izrazi so 'požrešni' - that is they will match as much text as they can. Consider using curly and square brackets; for example [^ ]{1,5}\> matches 1 to 5 non-space characters at the end of a word.
Bodite pozorni pri uporabi gumba Zamenjaj vse. There are a few rare occasions when this will give unexpected results. For example to remove the first character of every paragraph you might 'Search for' ^. and 'Replace with' nothing; clicking 'Replace All' now will wipe out *all* your text, instead of just the first character of each paragraph. Issue 82473 discusses this. The workaround is to 'Find All', then 'Replace'; perhaps the safest way is not to use the 'Replace All' button at all with regular expressions.

Nasveti in triki

Tukaj je nekaj primerov, ki vam morda pridejo prav:

\<([^ ]+)[ ]+\1

najde podvojene besede, ločene s presledki (upoštevajte, da mora biti pred vsakim ] presledek)

\<[1-9][0-9]*\>

najde desetiška števila

\<0[0-7]*\>

najde osmiška števila (osnova 8)

\<0x[A-Fa-f0-9]+\>

najde šestnajstiška števila (osnova 16)

[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-z]{2,6}

najde večino e-poštnih naslovov (zanje ne obstaja popoln regularni izraz - to je praktična rešitev)

Zunanje povezave

Paket regularnih izrazov ICU, kandidat za zamenjavo obstoječega sistema regularnih izrazov OpenOffice.org (glejte: Regexp).
Primeri regularnih izrazov (OpenOffice.org Ninja)
Backreferences in substitutions (OpenOffice.org Ninja)
Vodnik za regularne izraze v OpenOffice.org (OpenOffice.org Ninja)