Difference between revisions of "Documentation/UserGuide/Advanced/RegExp"

From Apache OpenOffice Wiki
Jump to: navigation, search
(Regular Expressions on Writer)
m (The Dollar Sign $)
 
(22 intermediate revisions by 4 users not shown)
Line 1: Line 1:
{{Documentation/DraftPage}}
+
{{DraftPage|EN}}
  
 
{{DISPLAYTITLE:Introduction to Regular Expressions}}
 
{{DISPLAYTITLE:Introduction to Regular Expressions}}
Line 5: Line 5:
 
= Introduction =
 
= Introduction =
  
[[Documentation/UserGuide/SearchReplace#A_note_about_Regular_Expressions|We already mentioned]] what [http://en.wikipedia.org/wiki/Regular_expressions Regular Expressions] are. With Regular Expressions (from now, RegExp), you can "match" non fixed text strings. For example, the expression  
+
[[Documentation/UserGuide/SearchReplace#A_note_about_Regular_Expressions|We already mentioned]] what [https://en.wikipedia.org/wiki/Regular_expressions Regular Expressions] are. With Regular Expressions (from now, RegExp), you can "match" non fixed text strings. For example, the expression  
  
 
   [:number:]  
 
   [:number:]  
Line 19: Line 19:
 
will match an arbitrary number that may (or may not) have a decimal part separated by a dot.
 
will match an arbitrary number that may (or may not) have a decimal part separated by a dot.
  
As it is possible to see from these examples, on RegExp several characters have a particular meaning. On this article the main characteristics of the [http://en.wikipedia.org/wiki/International_Components_for_Unicode ICU] based RegExp engine used by Apache OpenOffice will be commented.  
+
RegExp expressions are case-insensitive by default, that is, they match upper and lower case letters alike. You enable case-sensitivity by using the „Match case“ check box.
 +
 
 +
As it is possible to see from these examples, on RegExp several characters have a particular meaning. On this article the main characteristics of the [https://en.wikipedia.org/wiki/International_Components_for_Unicode ICU] based RegExp engine used by {{AOo}} will be commented.
  
 
= Special Characters =  
 
= Special Characters =  
  
As previously mentioned, several characters have a particular meaning inside regular expressions. These characters are all kind of brackets, the back slash, the dot, the dollar sign, etcetera. In order to search for those special characters we need to "scape" them with a back slash: for example, to search for [ it is needed to write
+
As previously mentioned, several characters have a particular meaning inside regular expressions. These characters are all kind of brackets, the backslash, the dot, the dollar sign, etcetera. In order to search for those special characters we need to "escape" them with a backslash: for example, to search for [ it is needed to write
  
 
   \[
 
   \[
  
and of course, in order to search for the back slash, ''two'' back slashes must be written
+
and of course, in order to search for the backslash, ''two'' backslashes must be written
  
 
   \\
 
   \\
Line 51: Line 53:
 
   math?
 
   math?
  
Will match either mat or math.
+
will match either mat or math.
  
 
== The Plus ==
 
== The Plus ==
  
 
Used to match one or more instances.
 
Used to match one or more instances.
 +
 +
  math+
 +
 +
Will match math, mathh, mathhh...
  
 
== The Asterisk ==
 
== The Asterisk ==
Line 69: Line 75:
 
They can be used to set the number of times the previous character must be repeated. For example
 
They can be used to set the number of times the previous character must be repeated. For example
  
ou{2,5}ch!  
+
  ou{2,5}ch!  
  
 
will match ouuch!, ouuuch!, ouuuuch!, ouuuuuch!, but not ouch!
 
will match ouuch!, ouuuch!, ouuuuch!, ouuuuuch!, but not ouch!
Line 81: Line 87:
 
   ou{3,}ch!  
 
   ou{3,}ch!  
  
Will find the word with ''at least'' three u.  
+
Will find the word with ''at least'' three u.
  
 
== Circumflex ==
 
== Circumflex ==
Line 107: Line 113:
 
== The Dollar Sign $ ==
 
== The Dollar Sign $ ==
  
This symbol have different meanings on different contexts too.  
+
This symbol has different meanings on different contexts too.  
  
 
Used alone, it find paragraph breaks, but used with other characters it will find those characters at the end of the paragraph. For example, on the paragraph
 
Used alone, it find paragraph breaks, but used with other characters it will find those characters at the end of the paragraph. For example, on the paragraph
Line 118: Line 124:
  
 
will find the second instance of book, but not the first one.  
 
will find the second instance of book, but not the first one.  
On the Replace by box, the $ sign have a different meaning that we will see below.  
+
On the Replace by box, the $ sign has a different meaning, that we will see below.
  
== The Back Slash \ ==
+
== The Backslash \ ==
  
As mentioned before, the back slash can be used to scape special symbols in order to search for them, but it can also give a different meaning to normal characters. For example,
+
As mentioned before, the backslash can be used to escape special symbols in order to search for them, but it can also give a different meaning to normal characters. For example,
  
 
   \b
 
   \b
Line 141: Line 147:
  
 
will find any character inside a word.  
 
will find any character inside a word.  
 +
 +
  \W
 +
 +
(with a cap W) will find a character that is ''not'' a "word element" (a space, a question mark... etcetera)
  
 
   \t
 
   \t
Line 150: Line 160:
 
will ''find'' a line break, but on the replace box will insert a paragraph break (the same you ''find'' with $).  
 
will ''find'' a line break, but on the replace box will insert a paragraph break (the same you ''find'' with $).  
  
The combination \u followed by an exadecimal number can be used to search for specific characters with its unicode code. For example
+
The combination \u followed by an hexadecimal number can be used to search for specific characters with its Unicode code point. For example
  
 
   \u03b4
 
   \u03b4
Line 170: Line 180:
 
   (expression1)(expression2)
 
   (expression1)(expression2)
  
gives the possibility to ''call'' each expression later on the RegExp formula with a back slash followed by a number: \1 will call the first expression, \2 the second, etcetera.  
+
gives the possibility to ''call'' each expression later on the RegExp formula with a backslash followed by a number: \1 will call the first expression, \2 the second, etcetera.  
  
 
For example, the regular expression
 
For example, the regular expression
Line 178: Line 188:
 
(note the space between (w+) and the *) can be used to find repeated words: both \b expressions find a word boundary, \w+ find one or more word elements while the \1 call the same elements found on with the group (\w+).
 
(note the space between (w+) and the *) can be used to find repeated words: both \b expressions find a word boundary, \w+ find one or more word elements while the \1 call the same elements found on with the group (\w+).
  
In order to call the text found by the group on the Replace by box, you need to use the dollar sign followed by the expression number instead of the back slash.  
+
{{Note|An alternative that also searches for duplicated words with a punctuation sign between them could be
 +
 
 +
  (\b\w+\b)\W+\1\b
 +
 
 +
or
 +
 
 +
  \b(\w+)[^[:alpha:]]*\1\b
 +
 
 +
Note that the first expression will fail on AOO 3.4.1 because of a bug solved on 4.0.}}
 +
 
 +
In order to call the text found by the group on the Replace by box, you need to use the dollar sign followed by the expression number instead of the backslash.  
  
 
For example, searching by  
 
For example, searching by  
Line 188: Line 208:
 
   $1  
 
   $1  
  
will eliminate any duplicated word on the text.  
+
will eliminate any duplicated word on the text.
  
 
= Other Expressions =
 
= Other Expressions =
Line 210: Line 230:
 
   [:space:]
 
   [:space:]
  
will find any kind of space, even non breaking ones.  
+
will find any kind of space, even non-breaking ones.  
  
 
It is possible to use the dash together with the square brackets to set ranges. For example,  
 
It is possible to use the dash together with the square brackets to set ranges. For example,  
Line 226: Line 246:
 
= Limits on the Use of Regular Expressions =
 
= Limits on the Use of Regular Expressions =
  
RegExp can only search ''inside'' a paragraph: you cannot use them for example to find two paragraph looking for a particular end on the first one and a particular beginning on the second one.  
+
RegExp can only search ''inside'' a paragraph: you cannot use them for example to find two paragraphs looking for a particular end on the first one and a particular beginning on the second one.  
  
 
Paragraph marks cannot be found in combination with other text. For example, on a paragraph like
 
Paragraph marks cannot be found in combination with other text. For example, on a paragraph like
Line 238: Line 258:
 
will only find final dot, ''not the paragraph break''.  
 
will only find final dot, ''not the paragraph break''.  
  
Beside the already indicated expressions (&, \n, the call for groups with the back slash followed by a number), the "Replace by" box do not accept regular expression.  
+
Beside the already indicated expressions (&, \n, the call for groups with the dollar sign followed by a number), the "Replace by" box does not accept regular expression.
  
 
= Regular Expressions on Writer =  
 
= Regular Expressions on Writer =  
  
TODO
+
On Writer, going to {{Menu|Edit|Find & Replace}} will open the Find & Replace menu. There, with a click on {{Button|More Options}} you'll find a checkbox to enable the RegExp tool:
 +
 
 +
[[File:AOO-RegExpWriter.png]]
 +
 
 +
As you can see from the screenshot, it is possible to combine RegExp with [[Documentation/UserGuide/SearchReplace#More_options|other options]] like {{Button|Format}}.
  
 
= Regular Expressions on Calc =  
 
= Regular Expressions on Calc =  
Line 262: Line 286:
 
= More information =
 
= More information =
  
http://userguide.icu-project.org/strings/regexptex
+
https://unicode-org.github.io/icu/userguide/
  
 
[[Category:Documentation]]
 
[[Category:Documentation]]

Latest revision as of 12:08, 2 October 2021

Editing.png This page is in a DRAFT stage.



Introduction

We already mentioned what Regular Expressions are. With Regular Expressions (from now, RegExp), you can "match" non fixed text strings. For example, the expression

  [:number:] 

will match any digit,

  [:number:]+ 

will match an arbitrary integer and

  [:number:]+\.?[:number:]* 

will match an arbitrary number that may (or may not) have a decimal part separated by a dot.

RegExp expressions are case-insensitive by default, that is, they match upper and lower case letters alike. You enable case-sensitivity by using the „Match case“ check box.

As it is possible to see from these examples, on RegExp several characters have a particular meaning. On this article the main characteristics of the ICU based RegExp engine used by Apache OpenOffice will be commented.

Special Characters

As previously mentioned, several characters have a particular meaning inside regular expressions. These characters are all kind of brackets, the backslash, the dot, the dollar sign, etcetera. In order to search for those special characters we need to "escape" them with a backslash: for example, to search for [ it is needed to write

  \[

and of course, in order to search for the backslash, two backslashes must be written

  \\

The Dot

The dot can be used to match a generic character.

  d.t

will match dat, dot, dut... even dXt.

The expression

  [:any:]

is equivalent to the dot.

Question Mark

Used to match zero or one instances of the previous character. For example

  math?

will match either mat or math.

The Plus

Used to match one or more instances.

  math+

Will match math, mathh, mathhh...

The Asterisk

Match an arbitrary number of instances, even zero:

  math*

will match mat, math, mathhhhhh...

Curly Bracket

They can be used to set the number of times the previous character must be repeated. For example

  ou{2,5}ch! 

will match ouuch!, ouuuch!, ouuuuch!, ouuuuuch!, but not ouch!

Using only one number will match the exact number

  ou{3}ch!

Will only find ouuuch!, while

  ou{3,}ch! 

Will find the word with at least three u.

Circumflex

The ^ sign have several meanings depending on the context. For one part, it can be used to match anything at the beginning of a paragraph. For example, in a paragraph like

  The text on the book

the expression

  ^the

will match the first The, but not the second.

Inside square brackets, the ^ is used to negate a character. For example,

  [^a]

will match any character that is not an a, while

  [^ar] 

will match a character that is not either an a nor a r.

The Dollar Sign $

This symbol has different meanings on different contexts too.

Used alone, it find paragraph breaks, but used with other characters it will find those characters at the end of the paragraph. For example, on the paragraph

  Your book is under this book

the expression

  book$

will find the second instance of book, but not the first one. On the Replace by box, the $ sign has a different meaning, that we will see below.

The Backslash \

As mentioned before, the backslash can be used to escape special symbols in order to search for them, but it can also give a different meaning to normal characters. For example,

  \b

will search for a word boundary:

  \bjus

will match the first three characters on just, justice, justly... but not adjust, while

  mon\b

will find the last three characters on common, summon... but not monitor.

The expression

  \w 

will find any character inside a word.

  \W

(with a cap W) will find a character that is not a "word element" (a space, a question mark... etcetera)

  \t

a tab stop

  \n

will find a line break, but on the replace box will insert a paragraph break (the same you find with $).

The combination \u followed by an hexadecimal number can be used to search for specific characters with its Unicode code point. For example

  \u03b4

will find the Greek character delta δ

The Vertical Bar |

This can be used to set alternatives. For example

  mo[t|r]e

will find either mote or more. This can be used on expressions with more options, like [a|b|c].

Groups and References

Grouping expressions with parenthesis, like

  (expression1)(expression2)

gives the possibility to call each expression later on the RegExp formula with a backslash followed by a number: \1 will call the first expression, \2 the second, etcetera.

For example, the regular expression

  \b(\w+) *\1\b

(note the space between (w+) and the *) can be used to find repeated words: both \b expressions find a word boundary, \w+ find one or more word elements while the \1 call the same elements found on with the group (\w+).

Documentation note.png An alternative that also searches for duplicated words with a punctuation sign between them could be
  (\b\w+\b)\W+\1\b

or

  \b(\w+)[^[:alpha:]]*\1\b

Note that the first expression will fail on AOO 3.4.1 because of a bug solved on 4.0.

In order to call the text found by the group on the Replace by box, you need to use the dollar sign followed by the expression number instead of the backslash.

For example, searching by

  \b(\w+) *\1\b

and replacing by

  $1 

will eliminate any duplicated word on the text.

Other Expressions

  [:alpha:]

will find any ASCII character, but not accented character nor numbers, for that you need to use

  [:any:]

or the dot, as seen before.

  [:number:]

or

  [:digit:]

will find a single digit.

  [:space:]

will find any kind of space, even non-breaking ones.

It is possible to use the dash together with the square brackets to set ranges. For example,

  [0-8]

will find any digit but 9.

On the replace box,

  &

will insert the same string found with the Search RegExp.

Limits on the Use of Regular Expressions

RegExp can only search inside a paragraph: you cannot use them for example to find two paragraphs looking for a particular end on the first one and a particular beginning on the second one.

Paragraph marks cannot be found in combination with other text. For example, on a paragraph like

 A short example. This is the end of the text.

the expression

 \.$

will only find final dot, not the paragraph break.

Beside the already indicated expressions (&, \n, the call for groups with the dollar sign followed by a number), the "Replace by" box does not accept regular expression.

Regular Expressions on Writer

On Writer, going to Edit → Find & Replace will open the Find & Replace menu. There, with a click on  More Options  you'll find a checkbox to enable the RegExp tool:

AOO-RegExpWriter.png

As you can see from the screenshot, it is possible to combine RegExp with other options like  Format .

Regular Expressions on Calc

TODO

Search and Replace with Search Dialogue

TODO

Regular Expressions on Calc Formulas

TODO

Regular Expressions on Impress

TODO

More information

https://unicode-org.github.io/icu/userguide/

Personal tools