Talk:Documentation/How Tos/Regular Expressions in Writer
- 1 References
- 2 Examples
- 3 Workarounds
- 4 Versioning of regex howto in future
- 5 Backreferences
- 6 External Links
- 7 OOo 3.0 Help links
- 8 ICU
- 9 Special characters, pairs, & hexadecimal codes inside square brackets
- 10 Treatment of hard paragraph marks & soft hard line breaks in the text body
- 11 Lazy matching
- 12 supported subset
- 13 Improved RegEx for matching douplicate words
Some of the matters arising with regex can be found in the OOo archives:
http://qa.openoffice.org/issues/show_bug.cgi?id=15666 (now RESOLVED FIXED)
This is probably too arcane, but here's a discussion on the black art of finding octal, decimal, & hexadecimal numbers in Writer http://www.oooforum.org/forum/viewtopic.phtml?t=66319
Octal \<0[0-7]*\> Decimal \<[1-9][0-9]*\> Hex \<0x[A-Fa-f0-9]+\>
In at least some versions of Linux it is possible to use unicode input to directly type a newline (line feed / soft line break/ U+000A) in the "Replace with" input box. There are no reports so far of this working on any other OS.
This means that for some people it is possible to insert a newline using Find & Replace.
Versioning of regex howto in future
See my comment under the same heading in Talk:Documentation/How Tos/Regular Expressions in Calc --Hgreenhough 12:05, 13 December 2007 (CET)
Thanks - I responded there --Drking 10:00, 18 January 2008 (GMT)
http://www.openoffice.org/issues/show_bug.cgi?id=15666#desc100 describes a glitch in the new backreference feature. It does not seem to have been reported as a seperate issue, so may not get picked up. One to test for!
Capitalize words beginning with h: s/\<h([a-z]+)/ r/H$1/ Match case = Yes Starting text: He heard quiet steps behind him. Expected result: He Heard quiet steps behind Him. Actual result: He H$1 quiet steps behind H$1
http://qa.openoffice.org/issues/show_bug.cgi?id=84922 describes a situation where backreferences do not work in find, although I can't follow it myself.
I've commented in issue 84922 - I don't think this is a bug in OOo
--Drking 7:00, 26 January 2008 (GMT)
Someone (Andrewz) has added a couple of external links. Absolutely great that people are getting involved, but I'm not sure that external links are a good idea - thoughts welcome...
The application Help is planned to be wiki based in the future, so that people like us can contribute easily. This page links from the Calc function Help (in preparation) and probably will link from Writer Help in the long term. It may or may not be included in Help, depending on space. I hope at the very least that it will be brought up in a browser on clicking in the Help.
As it stands clicking on the external links in this page takes the user away from the Help system and the Wiki altogether - the only way back is via the back button. That's a bad thing.
Another thing is that in my view the purpose of this HowTo is to give the user the information - not to present him with stuff within which information can be found. So if the info on the linked pages is useful it should be in the HowTo.
And a third thing is that there are plenty of external pages that could be linked to - everyone has their own favourite - but this shouldn't be a directory for them.
Interested to know any other views....
--Drking 05:00, 23 January 2008 (GMT)
Agreed. 'Concise, precise, complete', is best. I think the links would be better here in Talk, under a heading such as 'External Links', or 'Further Reading', so they could be used for research by editors (and anyone else interested).
--Hgreenhough 11:03, 23 January 2008 (CET)
Thanks - can't tackle it right now, but doutless one of us will...
--Drking 8:45, 24 January 2008 (GMT)
I've changed my mind. How about an 'External Links' section at the end of the article, like on Wikipedia? That way they're easy to find (very few visitors will ever look at the discussion), all together (for easier maintenance), but clearly seperate.
--Hgreenhough 13:52, 25 January 2008 (CET)
Still thinking about it - not quite convinced - at the moment our content is (I hope) pretty definitive. External pages might not be (for instance although the Andrewz links are pretty good, there are some things that I'd take issue with, like the e-address regex needs case-insensitive). Perhaps if our 'External Links' section made it clear they were simply additional pages, and took you away from the wiki?
I cannot find a way to open an external link in a new browser - which really ought to be possible. Any ideas?
--Drking 7:00, 25 January 2008 (GMT)
Is OpenOffice.org not as much a community as it is a product? While a link dump is bad, a moderated selections of links may benefit the end user. Wikipedia, which has an offline edition, includes a few links to external sites. Also, for what it's worth, even Microsoft Office's built-in help includes links.
Taking the argument further and generalizing it to the whole Documentation section of the wiki: if you remove all links, what else would you remove now or prevent from being included in the future?
Drking: about the email address regex. OOo does case insensitive matching by default. You suggest I change a-z to a-zA-Z?
--Andrewz 01:43, 28 January 2008 (CET)
I intended to contact you directly to alert you to this thread - my apologies.
> Is OpenOffice.org not as much a community as it is a product?
I think it aims to be a product, supported by the community. I don't think it exists to create and nurture a community.
> what else would you remove now or prevent from being included in the future?
That allows quite a good illustration: the unstructured community approach has allowed either 4 or 5 Calc FAQs to be written; you can reach them all via different routes from the Doc front page. None of them are complete. I think all of them are out of date. A couple of them as I recall are un-indexed. A real mess. More is not better. But of course a lot of worthy people poured effort into writing them, and they did it because they enjoyed contributing.
My view is that good documentation has to be absolutely focused on the user - not the person writing the documentation as it often has been. It is a pernickety business; I go back through my stuff and reduce the word count / clarify if I possibly can, always reading it as a user. (You can tell by the way I write here that this does not come easily;) ).
'Concise, precise, complete' says it better.
Actually, yourself and Hgreenhough have convinced me that we should have an external links section, so that it is clear which links are external. A user should expect a mid-article link to lead to a page in the same style, with the same style of information. Pernickety, but that makes the documentation good.
Now, I've done a lot of work to get this page up and running, but I have no rights over content, and it's a thrill to see other contributors taking an interest. Do we now have a consensus that we introduce an external links section?
> email regex ... You suggest I change a-z to a-zA-Z
Yes that would be my take, for the info to be 'complete'. Or say "turn on case sensitive".
I'd also allow for the .museum domain - I think the limit on domain size is really intended for data entry, not searching within a document.
And I'd also point out that it is a (good) practical but not a perfect solution - doesn't catch every e-address.
Bit pernickety, that...
--Drking 20:30, 28 January 2008 (GMT)
Well, I'm now agreed with Drking regarding external links, i.e. to be included, in a section at the end.
On the other subject... I can no longer see any blogspot pages because they are now blocked by Websense at our corporate firewall, so I can't refer to Andrewz's pages. But talk of .museum and postcodes reminded me it must be regarding the subject of this old thread:
I remember thinking there probably wasn't a perfect solution using regex.
--Hgreenhough 11:51, 29 January 2008 (CET)
I added the External Links section.
Re sorting by email addresses - an interesting challenge :). Defeated me so far. Probably one of those things that spreadsheets shouldn't be used for.
(edit) Incidentally, Andrewz, I'm afraid your email regex doesn't match 'firstname.lastname@example.org' - it finds 'email@example.com' only. OOo seems to treat '-' as a word boundary, although '+' is not treated as a word boundary. Hm.
--Drking 06:30, 02 February 2008 (GMT)
The email regex doesn't match 'firstname.lastname@example.org' because the '-' in  was un-escaped. I also can't find a reason to include \< and \> in OOo. I've added my version to our tips and tricks list now.
--Drking 06:30, 07 February 2008 (GMT)
Flepennu has tried to add a link to French translations (so far unfinished) at the end of both the Writer & Calc How To. The links are not formed correctly and so do not appear on the article. Being as they are not external, should we rename the section 'Links' instead of 'External Links', or should the French translations be linked to elsewhere in the article?
On the subject of email addresses: http://www.regular-expressions.info/email.html
--Hgreenhough 10:43, 19 February 2008 (CET)
Just checked and the note about / link to the French translation appears on the left side of the page - 'in other languages'. That's rather nice :). As you say it hasn't got very far though. Anyway it means we don't need to consider renaming the section to 'Links' yet. Don't have a particular view on this actually, so feel free to choose...
I've just published an Arrays HowTo by the way - should you feel inclined to peruse/refine it...
--Drking 06:00, 18 March 2008 (GMT)
I see, the fr: prefix to the link is wiki-code. I have moved the link to the start of the source, for clarity. Likewise the Calc howto.
I couldn't do much on the arrays page, just a few links, since I'm not that hot on them. I have added a comment in the discussion there.
--Hgreenhough 11:44, 18 March 2008 (CET)
Issue 87670 calls for external links to be removed from online help - not directly relevant here, but worth considering in case content is reused.
--Hgreenhough 10:28, 1 April 2008 (CEST)
The application help of OOo 3.0 links to this Wiki page and the one for Calc. The links are below the list of regular expressions:
We would like to improve the application help by inserting many more links leading to Wiki pages.
(noting the above comment from 'Ufi')
Excellent! thank you Uwe
--Drking 8:45, 24 January 2008 (GMT)
Next to this statement could be added an reference for authority and research?
The ICU regular expression package, a candidate to replace the existing OOo regular expression engine
The ICU question above was from Andrewz on 9/Feb/2008 (please could everyone use the 'sig+timestamp' button).
Do you mean where is it said that the ICU regex engine should replace the current one? If so, see: Regexp.
--Hgreenhough 10:04, 11 February 2008 (CET)
Special characters, pairs, & hexadecimal codes inside square brackets
... suggests ^[^\x0009] will select the first character of any paragraph not beginning with a tab, which indeed it seems to. I can't reconcile that behaviour with what is currently written in the 'Alernative matches' section of the How To. What is actually happening?
--Hgreenhough 13:04, 31 March 2008 (CEST)
For a start the posters haven't read the HowTo well enough: "For example [\t ] will match a backslash \, a 't' or a space - not a tab or a space."
So \t should not work inside . But neither should \x0009. I ask the guys at Sun if this is reliable behaviour.
--drking 06:00, 07 April 2008 (GMT)
Yes, looks like this bit needs a re-write. I'd obviously thought that OOo did Posix behaviour here, but evidently not. Many thanks for finding it :). Interestingly [/x9] will find tabs (the ] seems to cut short the hexadecimal number), but in [t/] the ] is assumed escaped (doesn't cut short) so it finds nothing. Don't think that should go in though...
--drking 21:25, 07 April 2008 (GMT)
hard paragraph marks & soft hard line breaks in the text body
- I think it is fairly well known that OpenOffice REs cannot search across a paragraph mark
(hard line break), but I can't see where, if at all, this information appears in the How To. SoftHard line breaks (Shift+Enter in Writer) don't seem to be 'counted' by regular expressions: a.c does not find 'a followed by a softhard line break followed by c'. It does find 'a followed by a tab followed by c', and also 'abc' of course. Also ^. selects the second character of a paragraph if the first is a soft line break. More confusingly, ^[^a] will select a softhard line break at the beginning of a paragraph, and most confusingly of all, it will select both the softhard line break and the following character if that is not a.
--Hgreenhough 14:23, 31 March 2008 (CEST)
In the "How regular expressions are applied in OpenOffice.org" section. I *think* it's clear - but you are a better judge than me, and if you missed it then perhaps it could be changed? Feel free...
>a.c does not find 'a followed by a
soft hard line break followed by c'.
In the "Single character match . ? " section: "The dot '.' special character stands for any single character (except newline)." I think that's OK?
>Also ^. selects the second character of a paragraph if the first is a
soft hard line break.
We have a terminology difference (soft v hard line break) but that aside:
"Positional match ^ $ \< \> " section
"In addition a hard line break (entered by Shift-Enter) is considered the beginning / end of text, and will allow a ^ or $ match."
So I think that's OK?
I think this is OK - it's simply the regex engine doing what it's been told to - sometimes it considers newline a character and sometimes a start/end of text marker. And in the second case both I think ;). I won't defend the logic - it's an entirely mis-thought out concept and the sooner we get a new regex engine the better.
Maybe we need some more info in the Troubleshooting section to clarify this...?
--drking 22:00, 7 April 2008 (GMT)
(a) There it is! That's fine, I just wasn't seeing it.
(b) I remember now - "soft line break" is word wrapping applied by the display engine, and "hard line break" is Shift+Enter. My mistake (now strikethrough). I agree with your changes.
--Hgreenhough 11:37, 8 April 2008 (CEST)
* and + match greedily by default. Is there a way to force them to match lazily (that is, matching as few characters as possible)? If not, is it on a wishlist somewhere? Curtmack 20:07, 1 April 2010 (UTC)
I do miss a table or other kind of comparison with all those features which are not supported here, although they are very common in other implementations.
What comes to my mind are e.g. many escaped characters such as
\b (and \B), where I do see \< and \> only for word boundaries \d for [:digit:] \w for [:word:] \s for [:space:] ... and always their negations, such as \D for [^0-9]
These are very common e.g. in perl, but available in many other regex engines as well.
And how about \r in OOo? There's just newline \n, but not the \r for carriage return?
Improved RegEx for matching douplicate words
I just posted an improved version of the RegEx for finding duplicate words in the forum: http://www.oooforum.org/forum/viewtopic.phtml?p=418058#418058 , including detailed description of what that RegEx is capable of and examples. Should the entry for it in the Tips and Tricks section also be updated? If you think so, feel free to copy it in here, including the description and examples. They are all public domain.
Greetings --Frog23 11:46, 13 March 2011 (UTC)