The Structure of Text Documents

From Apache OpenOffice Wiki
Jump to: navigation, search


A text document can essentially contain four types of information:

  • The actual text
  • Templates for formatting characters, paragraphs, and pages
  • Non-text elements such as tables, graphics and drawing objects
  • Global settings for the text document

This section concentrates on the text and associated formatting options.

Paragraphs and Paragraph Portions

The core of a text document consists of a sequence of paragraphs. These are neither named nor indexed and there is therefore no possible way of directly accessing individual paragraphs. The paragraphs can however be sequentially traversed with the help of the Enumeration object described in Introduction to the API. This allows the paragraphs to be edited.

When working with the Enumeration object, one special scenario should, however, be noted: it not only returns paragraphs, but also tables (strictly speaking, in Apache OpenOffice Writer, a table is a special type of paragraph). Before accessing a returned object, you should therefore check whether the returned object supports the com.sun.star.text.Paragraph service for paragraphs or the com.sun.star.text.TextTable service for tables.

The following example traverses the contents of a text document in a loop and uses a message in each instance to inform the user whether the object in question is a paragraph or table.

Dim Doc As Object
Dim Enum As Object
Dim TextElement As Object
 
' Create document object   
Doc = ThisComponent
' Create enumeration object 
Enum = Doc.Text.createEnumeration
' loop over all text elements
 
While Enum.hasMoreElements
  TextElement = Enum.nextElement
 
  If TextElement.supportsService("com.sun.star.text.TextTable") Then
    MsgBox "The current block contains a table."
  End If
 
  If TextElement.supportsService("com.sun.star.text.Paragraph") Then
    MsgBox "The current block contains a paragraph."
  End If
 
Wend

The example creates a Doc document object which references the current Apache OpenOffice document. With the aid of Doc, the example then creates an Enumeration object that traverses through the individual parts of the text (paragraphs and tables) and assigns the current element to TextElement object. The example uses the supportsService method to check whether the TextElement is a paragraph or a table.

Paragraphs

The com.sun.star.text.Paragraph service grants access to the content of a paragraph. The text in the paragraph can be retrieved and modified using the String property:

Dim Doc As Object
Dim Enum As Object
Dim TextElement As Object
 
Doc = ThisComponent
Enum = Doc.Text.createEnumeration
 
While Enum.hasMoreElements
  TextElement = Enum.nextElement
 
  If TextElement.supportsService("com.sun.star.text.Paragraph") Then
    TextElement.String = Replace(TextElement.String, "you", "U") 
    TextElement.String = Replace(TextElement.String, "too", "2")
    TextElement.String = Replace(TextElement.String, "for", "4") 
  End If
 
Wend

The example opens the current text document and passes through it with the help of the Enumeration object. It uses the TextElement.String property in all paragraphs to access the relevant paragraphs and replaces the you, too and for strings with the U, 2 and 4 characters. The Replace function used for replacing does not fall within the standard linguistic scope of Apache OpenOffice Basic. This is an instance of the example function described in Search and Replace.

Documentation note.png VBA : The content of the procedure described here for accessing the paragraphs of a text is comparable with the Paragraphs listing used in VBA, which is provided in the Range and Document objects available there. Whereas in VBA the paragraphs are accessed by their number (for example, by the Paragraph(1) call), in Apache OpenOffice Basic, the Enumeration object described previously should be used.


There is no direct counterpart in Apache OpenOffice Basic for the Characters, Sentences and Words lists provided in VBA. You do, however, have the option of switching to a TextCursor which allows for navigation at the level of characters, sentences and words.

Paragraph Portions

The previous example may change the text as requested, but it may sometimes also destroy the formatting.

This is because a paragraph in turn consists of individual sub-objects. Each of these sub-objects contains its own formatting information. If the center of a paragraph, for example, contains a word printed in bold, then it will be represented in Apache OpenOffice by three paragraph portions: the portion before the bold type, then the word in bold, and finally the portion after the bold type, which is again depicted as normal.

If the text of the paragraph is now changed using the paragraph's String property, then Apache OpenOffice first deletes the old paragraph portions and inserts a new paragraph portion. The formatting of the previous sections is then lost.

To prevent this effect, the user can access the associated paragraph portions rather than the entire paragraph. Paragraphs provide their own Enumeration object for this purpose. The following example shows a double loop which passes over all paragraphs of a text document and the paragraph portions they contain and applies the replacement processes from the previous example:

Dim Doc As Object
Dim Enum1 As Object
Dim Enum2 As Object
Dim TextElement As Object
Dim TextPortion As Object
 
Doc = ThisComponent
Enum1 = Doc.Text.createEnumeration
 
' loop over all paragraphs
While Enum1.hasMoreElements
  TextElement = Enum1.nextElement
 
  If TextElement.supportsService("com.sun.star.text.Paragraph") Then
    Enum2 = TextElement.createEnumeration
    ' loop over all sub-paragraphs 
 
    While Enum2.hasMoreElements
      TextPortion = Enum2.nextElement
      MsgBox "'" & TextPortion.String & "'"
      TextPortion.String = Replace(TextPortion.String, "you", "U") 
      TextPortion.String = Replace(TextPortion.String, "too", "2")
      TextPortion.String = Replace(TextPortion.String, "for", "4") 
    Wend
 
  End If
Wend

The example runs through a text document in a double loop. The outer loop refers to the paragraphs of the text. The inner loop processes the paragraph portions in these paragraphs. The example code modifies the content in each of these paragraph portions using the String property of the string. as is the case in the previous example for paragraphs. Since however, the paragraph portions are edited directly, their formatting information is retained when replacing the string.

Formatting

There are various ways of formatting text. The easiest way is to assign the format properties directly to the text sequence. This is called direct formatting. Direct formatting is used in particular with short documents because the formats can be assigned by the user with the mouse. You can, for example, highlight a certain word within a text using bold type or center a line.

In addition to direct formatting, you can also format text using templates. This is called indirect formatting. With indirect formatting, the user assigns a pre-defined template to the relevant text portion. If the layout of the text is changed at a later date, the user only needs to change the template. Apache OpenOffice then changes the way in which all text portions which use this template are depicted.

Documentation note.png VBA : In VBA, the formatting properties of an object are usually spread over a range of sub-objects (for example, Range.Font, Range.Borders, Range.Shading, Range.ParagraphFormat). The properties are accessed by means of cascading expressions (for example, Range.Font.AllCaps). In Apache OpenOffice Basic, the formatting properties on the other hand are available directly, using the relevant objects (TextCursor, Paragraph, and so on). You will find an overview of the character and paragraph properties available in Apache OpenOffice in the following two sections.


Documentation note.png The formatting properties can be found in each object (Paragraph, TextCursor, and so on) and can be applied directly.

Character Properties

Those format properties that refer to individual characters are described as character properties. These include bold type and the font type. Objects that allow character properties to be set have to support the com.sun.star.style.CharacterProperties service. Apache OpenOffice recognizes a whole range of services that support this service. These include the previously described com.sun.star.text.Paragraph services for paragraphs as well as the com.sun.star.text.TextPortion services for paragraph portions.

The com.sun.star.style.CharacterProperties service does not provide any interfaces, but instead offers a range of properties through which character properties can be defined and called. A complete list of all character properties can be found in the Apache OpenOffice API reference. The following list describes the most important properties:

CharFontName (String)
name of font type selected.
CharColor (Long)
text color.
CharHeight (Float)
character height in points (pt).
CharUnderline (Constant group)
type of underscore (constants in accordance with com.sun.star.awt.FontUnderline ).
CharWeight (Constant group)
font weight (constants in accordance with com.sun.star.awt.FontWeight).
CharBackColor (Long)
background color.
CharKeepTogether (Boolean)
suppression of automatic line break.
CharStyleName (String)
name of character template.

Paragraph Properties

Formatting information that does not refer to individual characters, but to the entire paragraph is considered to be a paragraph property. This includes the distance of the paragraph from the edge of the page as well as line spacing. The paragraph properties are available through the com.sun.star.style.ParagraphProperties service.

Even the paragraph properties are available in various objects. All objects that support the com.sun.star.text.Paragraph service also provide support for the paragraph properties in com.sun.star.style.ParagraphProperties.

A complete list of the paragraph properties can be found in the Apache OpenOffice API reference. The most common paragraph properties are:

ParaAdjust (enum)
vertical text orientation (constants in accordance with com.sun.star.style.ParagraphAdjust ).
ParaLineSpacing (struct)
line spacing (structure in accordance with com.sun.star.style.LineSpacing).
ParaBackColor (Long)
background color.
ParaLeftMargin (Long)
left margin in 100ths of a millimeter.
ParaRightMargin (Long)
right margin in 100ths of a millimeter.
ParaTopMargin (Long)
top margin in 100ths of a millimeter.
ParaBottomMargin (Long)
bottom margin in 100ths of a millimeter.
ParaTabStops (Array of struct)
type and position of tabs (array with structures of the type com.sun.star.style.TabStop ).
ParaStyleName (String)
name of the paragraph template.

Example: simple HTML export

The following example demonstrates how to work with formatting information. It iterates through a text document and creates a simple HTML file. Each paragraph is recorded in its own HTML element <P> for this purpose. Paragraph portions displayed in bold type are marked using a <B> HTML element when exporting.

Dim FileNo As Integer, Filename As String, CurLine As String
Dim Doc As Object   
Dim Enum1 As Object, Enum2 As Object
Dim TextElement As Object, TextPortion As Object
 
Filename = "c:\text.html"
FileNo = Freefile
Open Filename For Output As #FileNo   
Print #FileNo, "<HTML><BODY>"
Doc = ThisComponent
Enum1 = Doc.Text.createEnumeration
 
' loop over all paragraphs
While Enum1.hasMoreElements
  TextElement = Enum1.nextElement
 
  If TextElement.supportsService("com.sun.star.text.Paragraph") Then
    Enum2 = TextElement.createEnumeration
    CurLine = "<P>"
 
    ' loop over all paragraph portions
    While Enum2.hasMoreElements
      TextPortion = Enum2.nextElement
 
      If TextPortion.CharWeight = com.sun.star.awt.FontWeight.BOLD THEN
        CurLine = CurLine & "<B>" & TextPortion.String & "</B>"
      Else
        CurLine = CurLine & TextPortion.String
      End If
 
    Wend
 
    ' output the line
    CurLine = CurLine & "</P>"
    Print #FileNo, CurLine
  End If
 
Wend
 
' write HTML footer 
Print #FileNo, "</BODY></HTML>"
Close #FileNo

The basic structure of the example is oriented towards the examples for running though the paragraph portions of a text already discussed previously. The functions for writing the HTML file, as well as a test code that checks the font weight of the corresponding text portions and provides paragraph portions in bold type with a corresponding HTML tag, have been added.

Default values for character and paragraph properties

Direct formatting always takes priority over indirect formatting. In other words, formatting using templates is assigned a lower priority than direct formatting in a text.

Establishing whether a section of a document has been directly or indirectly formatted is not easy. The symbol bars provided by Apache OpenOffice show the common text properties such as font type, weight and size. However, whether the corresponding settings are based on template or direct formatting in the text is still unclear.

Apache OpenOffice Basic provides the getPropertyState method, with which programmers can check how a certain property was formatted. As a parameter, this takes the name of the property and returns a constant that provides information about the origin of the formatting. The following responses, which are defined in the com.sun.star.beans.PropertyState enumeration, are possible:

com.sun.star.beans.PropertyState.DIRECT_VALUE
the property is defined directly in the text (direct formatting)
com.sun.star.beans.PropertyState.DEFAULT_VALUE
the property is defined by a template (indirect formatting)
com.sun.star.beans.PropertyState.AMBIGUOUS_VALUE
the property is unclear. This status arises, for example, when querying the bold type property of a paragraph, which includes both words depicted in bold and words depicted in normal font.

The following example shows how format properties can be edited in Apache OpenOffice. It searches through a text for paragraph portions which have been depicted as bold type using direct formatting. If it encounters a corresponding paragraph portion, it deletes the direct formatting using the setPropertyToDefault method and assigns a MyBold character template to the corresponding paragraph portion.

Dim Doc As Object
Dim Enum1 As Object
Dim Enum2 As Object
Dim TextElement As Object
Dim TextPortion As Object
 
Doc = ThisComponent
Enum1 = Doc.Text.createEnumeration
 
' loop over all paragraphs
While Enum1.hasMoreElements
  TextElement = Enum1.nextElement
 
  If TextElement.supportsService("com.sun.star.text.Paragraph") Then
    Enum2 = TextElement.createEnumeration
    ' loop over all paragraph portions
 
    While Enum2.hasMoreElements
      TextPortion = Enum2.nextElement
 
      If TextPortion.CharWeight = _
        com.sun.star.awt.FontWeight.BOLD AND _
        TextPortion.getPropertyState("CharWeight") = _
        com.sun.star.beans.PropertyState.DIRECT_VALUE Then
          TextPortion.setPropertyToDefault("CharWeight")
          TextPortion.CharStyleName = "MyBold" 
      End If
    Wend
  End If
Wend


Content on this page is licensed under the Public Documentation License (PDL).
Personal tools