The Structure of Text Documents
- The Structure of Text Documents
- Editing Text Documents
- More than Just Text
A text document can essentially contain four types of information:
- The actual text
- Templates for formatting characters, paragraphs, and pages
- Non-text elements such as tables, graphics and drawing objects
- Global settings for the text document
This section concentrates on the text and associated formatting options.
Contents
Paragraphs and Paragraph Portions
The core of a text document consists of a sequence of paragraphs. These are neither named nor indexed and there is therefore no possible way of directly accessing individual paragraphs. The paragraphs can however be sequentially traversed with the help of the Enumeration object described in Introduction to the API. This allows the paragraphs to be edited.
When working with the Enumeration object, one special scenario should, however, be noted: it not only returns paragraphs, but also tables (strictly speaking, in Apache OpenOffice Writer, a table is a special type of paragraph). Before accessing a returned object, you should therefore check whether the returned object supports the com.sun.star.text.Paragraph service for paragraphs or the com.sun.star.text.TextTable service for tables.
The following example traverses the contents of a text document in a loop and uses a message in each instance to inform the user whether the object in question is a paragraph or table.
Dim Doc As Object Dim Enum As Object Dim TextElement As Object ' Create document object Doc = ThisComponent ' Create enumeration object Enum = Doc.Text.createEnumeration ' loop over all text elements While Enum.hasMoreElements TextElement = Enum.nextElement If TextElement.supportsService("com.sun.star.text.TextTable") Then MsgBox "The current block contains a table." End If If TextElement.supportsService("com.sun.star.text.Paragraph") Then MsgBox "The current block contains a paragraph." End If Wend
The example creates a Doc document object which references the current Apache OpenOffice document. With the aid of Doc, the example then creates an Enumeration object that traverses through the individual parts of the text (paragraphs and tables) and assigns the current element to TextElement object. The example uses the supportsService method to check whether the TextElement is a paragraph or a table.
Paragraphs
The com.sun.star.text.Paragraph service grants access to the content of a paragraph. The text in the paragraph can be retrieved and modified using the String property:
Dim Doc As Object Dim Enum As Object Dim TextElement As Object Doc = ThisComponent Enum = Doc.Text.createEnumeration While Enum.hasMoreElements TextElement = Enum.nextElement If TextElement.supportsService("com.sun.star.text.Paragraph") Then TextElement.String = Replace(TextElement.String, "you", "U") TextElement.String = Replace(TextElement.String, "too", "2") TextElement.String = Replace(TextElement.String, "for", "4") End If Wend
The example opens the current text document and passes through it with the help of the Enumeration object. It uses the TextElement.String property in all paragraphs to access the relevant paragraphs and replaces the you, too and for strings with the U, 2 and 4 characters. The Replace function used for replacing does not fall within the standard linguistic scope of Apache OpenOffice Basic. This is an instance of the example function described in Search and Replace.
There is no direct counterpart in Apache OpenOffice Basic for the Characters, Sentences and Words lists provided in VBA. You do, however, have the option of switching to a TextCursor which allows for navigation at the level of characters, sentences and words.
Paragraph Portions
The previous example may change the text as requested, but it may sometimes also destroy the formatting.
This is because a paragraph in turn consists of individual sub-objects. Each of these sub-objects contains its own formatting information. If the center of a paragraph, for example, contains a word printed in bold, then it will be represented in Apache OpenOffice by three paragraph portions: the portion before the bold type, then the word in bold, and finally the portion after the bold type, which is again depicted as normal.
If the text of the paragraph is now changed using the paragraph's String property, then Apache OpenOffice first deletes the old paragraph portions and inserts a new paragraph portion. The formatting of the previous sections is then lost.
To prevent this effect, the user can access the associated paragraph portions rather than the entire paragraph. Paragraphs provide their own Enumeration object for this purpose. The following example shows a double loop which passes over all paragraphs of a text document and the paragraph portions they contain and applies the replacement processes from the previous example:
Dim Doc As Object Dim Enum1 As Object Dim Enum2 As Object Dim TextElement As Object Dim TextPortion As Object Doc = ThisComponent Enum1 = Doc.Text.createEnumeration ' loop over all paragraphs While Enum1.hasMoreElements TextElement = Enum1.nextElement If TextElement.supportsService("com.sun.star.text.Paragraph") Then Enum2 = TextElement.createEnumeration ' loop over all sub-paragraphs While Enum2.hasMoreElements TextPortion = Enum2.nextElement MsgBox "'" & TextPortion.String & "'" TextPortion.String = Replace(TextPortion.String, "you", "U") TextPortion.String = Replace(TextPortion.String, "too", "2") TextPortion.String = Replace(TextPortion.String, "for", "4") Wend End If Wend
The example runs through a text document in a double loop. The outer loop refers to the paragraphs of the text. The inner loop processes the paragraph portions in these paragraphs. The example code modifies the content in each of these paragraph portions using the String property of the string. as is the case in the previous example for paragraphs. Since however, the paragraph portions are edited directly, their formatting information is retained when replacing the string.
Formatting
There are various ways of formatting text. The easiest way is to assign the format properties directly to the text sequence. This is called direct formatting. Direct formatting is used in particular with short documents because the formats can be assigned by the user with the mouse. You can, for example, highlight a certain word within a text using bold type or center a line.
In addition to direct formatting, you can also format text using templates. This is called indirect formatting. With indirect formatting, the user assigns a pre-defined template to the relevant text portion. If the layout of the text is changed at a later date, the user only needs to change the template. Apache OpenOffice then changes the way in which all text portions which use this template are depicted.
The formatting properties can be found in each object (Paragraph, TextCursor, and so on) and can be applied directly. |
Character Properties
Those format properties that refer to individual characters are described as character properties. These include bold type and the font type. Objects that allow character properties to be set have to support the com.sun.star.style.CharacterProperties service. Apache OpenOffice recognizes a whole range of services that support this service. These include the previously described com.sun.star.text.Paragraph services for paragraphs as well as the com.sun.star.text.TextPortion services for paragraph portions.
The com.sun.star.style.CharacterProperties service does not provide any interfaces, but instead offers a range of properties through which character properties can be defined and called. A complete list of all character properties can be found in the Apache OpenOffice API reference. The following list describes the most important properties:
- CharFontName (String)
- name of font type selected.
- CharColor (Long)
- text color.
- CharHeight (Float)
- character height in points (pt).
- CharUnderline (Constant group)
- type of underscore (constants in accordance with com.sun.star.awt.FontUnderline ).
- CharWeight (Constant group)
- font weight (constants in accordance with com.sun.star.awt.FontWeight).
- CharBackColor (Long)
- background color.
- CharKeepTogether (Boolean)
- suppression of automatic line break.
- CharStyleName (String)
- name of character template.
Paragraph Properties
Formatting information that does not refer to individual characters, but to the entire paragraph is considered to be a paragraph property. This includes the distance of the paragraph from the edge of the page as well as line spacing. The paragraph properties are available through the com.sun.star.style.ParagraphProperties service.
Even the paragraph properties are available in various objects. All objects that support the com.sun.star.text.Paragraph service also provide support for the paragraph properties in com.sun.star.style.ParagraphProperties.
A complete list of the paragraph properties can be found in the Apache OpenOffice API reference. The most common paragraph properties are:
- ParaAdjust (enum)
- vertical text orientation (constants in accordance with com.sun.star.style.ParagraphAdjust ).
- ParaLineSpacing (struct)
- line spacing (structure in accordance with com.sun.star.style.LineSpacing).
- ParaBackColor (Long)
- background color.
- ParaLeftMargin (Long)
- left margin in 100ths of a millimeter.
- ParaRightMargin (Long)
- right margin in 100ths of a millimeter.
- ParaTopMargin (Long)
- top margin in 100ths of a millimeter.
- ParaBottomMargin (Long)
- bottom margin in 100ths of a millimeter.
- ParaTabStops (Array of struct)
- type and position of tabs (array with structures of the type com.sun.star.style.TabStop ).
- ParaStyleName (String)
- name of the paragraph template.
Example: simple HTML export
The following example demonstrates how to work with formatting information. It iterates through a text document and creates a simple HTML file. Each paragraph is recorded in its own HTML element <P> for this purpose. Paragraph portions displayed in bold type are marked using a <B> HTML element when exporting.
Dim FileNo As Integer, Filename As String, CurLine As String Dim Doc As Object Dim Enum1 As Object, Enum2 As Object Dim TextElement As Object, TextPortion As Object Filename = "c:\text.html" FileNo = Freefile Open Filename For Output As #FileNo Print #FileNo, "<HTML><BODY>" Doc = ThisComponent Enum1 = Doc.Text.createEnumeration ' loop over all paragraphs While Enum1.hasMoreElements TextElement = Enum1.nextElement If TextElement.supportsService("com.sun.star.text.Paragraph") Then Enum2 = TextElement.createEnumeration CurLine = "<P>" ' loop over all paragraph portions While Enum2.hasMoreElements TextPortion = Enum2.nextElement If TextPortion.CharWeight = com.sun.star.awt.FontWeight.BOLD THEN CurLine = CurLine & "<B>" & TextPortion.String & "</B>" Else CurLine = CurLine & TextPortion.String End If Wend ' output the line CurLine = CurLine & "</P>" Print #FileNo, CurLine End If Wend ' write HTML footer Print #FileNo, "</BODY></HTML>" Close #FileNo
The basic structure of the example is oriented towards the examples for running though the paragraph portions of a text already discussed previously. The functions for writing the HTML file, as well as a test code that checks the font weight of the corresponding text portions and provides paragraph portions in bold type with a corresponding HTML tag, have been added.
Default values for character and paragraph properties
Direct formatting always takes priority over indirect formatting. In other words, formatting using templates is assigned a lower priority than direct formatting in a text.
Establishing whether a section of a document has been directly or indirectly formatted is not easy. The symbol bars provided by Apache OpenOffice show the common text properties such as font type, weight and size. However, whether the corresponding settings are based on template or direct formatting in the text is still unclear.
Apache OpenOffice Basic provides the getPropertyState method, with which programmers can check how a certain property was formatted. As a parameter, this takes the name of the property and returns a constant that provides information about the origin of the formatting. The following responses, which are defined in the com.sun.star.beans.PropertyState enumeration, are possible:
- com.sun.star.beans.PropertyState.DIRECT_VALUE
- the property is defined directly in the text (direct formatting)
- com.sun.star.beans.PropertyState.DEFAULT_VALUE
- the property is defined by a template (indirect formatting)
- com.sun.star.beans.PropertyState.AMBIGUOUS_VALUE
- the property is unclear. This status arises, for example, when querying the bold type property of a paragraph, which includes both words depicted in bold and words depicted in normal font.
The following example shows how format properties can be edited in Apache OpenOffice. It searches through a text for paragraph portions which have been depicted as bold type using direct formatting. If it encounters a corresponding paragraph portion, it deletes the direct formatting using the setPropertyToDefault method and assigns a MyBold character template to the corresponding paragraph portion.
Dim Doc As Object Dim Enum1 As Object Dim Enum2 As Object Dim TextElement As Object Dim TextPortion As Object Doc = ThisComponent Enum1 = Doc.Text.createEnumeration ' loop over all paragraphs While Enum1.hasMoreElements TextElement = Enum1.nextElement If TextElement.supportsService("com.sun.star.text.Paragraph") Then Enum2 = TextElement.createEnumeration ' loop over all paragraph portions While Enum2.hasMoreElements TextPortion = Enum2.nextElement If TextPortion.CharWeight = _ com.sun.star.awt.FontWeight.BOLD AND _ TextPortion.getPropertyState("CharWeight") = _ com.sun.star.beans.PropertyState.DIRECT_VALUE Then TextPortion.setPropertyToDefault("CharWeight") TextPortion.CharStyleName = "MyBold" End If Wend End If Wend
Content on this page is licensed under the Public Documentation License (PDL). |