Writer/BiDi Layout

From Apache OpenOffice Wiki
Jump to: navigation, search

Writer Icon.png

Writer Project

Please view the guidelines
before contributing.

Popular Subcategories:

Internal Documentation:

API Documentation:

Ongoing Efforts:

Sw.OpenOffice.org


Editing.png This page is in a DRAFT stage.



Bidirectional Text Formatting and Layout

Introduction

I'd like to discuss some topics referring to bidirectional text formatting for languages like Arabic and Hebrew. These languages are also referred to as CTL (Complex text layout) languages, although bidirectionality (BiDi) is only one possible aspect of CTL. CTL is a term for languages whose writing system needs complex transformations in order to visualize the text stored in memory. Examples are the already mentioned BiDi languages, languages using clustered characters like Thai or languages with characters whose visual representation depends on their context (e.g., ligatures).

BiDi languages have a general writing order from right-to-left and top-to-bottom, with small portions of text (for numbers and Latin text) written from left-to-right. To integrate support for bidirectional text formatting into the Writer, we need to know about how to map features and behavior from the 6.0 version of the Writer for western languages according to the requirements of BiDi.

If I am wrong with some of my assumptions about layout or formatting: Let us know!

If I forgot something: Let us know!

If I make suggestions, which are absolutely superfluous because nobody wants or uses them: Let us know!

It would be helpful if you can rate your feedback like this:

  1. Absolutely necessary! This refers to very basic and commonly known rules about BiDi typography/layout. BiDi without this feature or breaking this rule does not make sense.
  2. I want this and I know about lots of people who can hardly live without...
  3. I would like to have this because...

When showing examples in the following sections, I will mark right-to-left text by using CAPITAL LETTERS to distinguish it from left-to-right text used in the same context. I will use these abbreviations: LTR for left-to-right and RTL for right-to-left orientation.

RTL and LTR Areas

Just like the vertical direction for CJK documents, the general direction for text formatting is part of the page style and frame style. Once chosen to use a RTL orientation for a page, the default alignment is from right to left. The RTL or LTR orientation can differ for different paragraphs. Nevertheless it is possible, to have different writing directions within the same line:

Input Sequence:

english text ARABIC TEXT more english

LTR Paragraph:

english text TXET CIBARA more english

RTL Paragraph:

more english TXET CIBARA english text

We build different text portions for different scripts in the input string. RTL and LTR portions only differ in the way the text they are representing is visualized. We choose an automatic typing method, which recognizes the portion direction by analyzing the script of the text represented by this portion.

Header/Footer

Nothing has to be changed in the layout regarding the header and footer of a page. Everything is just the way it is in the western version of the Writer.


Header.png

Columns

Using a general LTR orientation in western layout implies using a LTR orientation when using columns. The same holds for RTL: Text formatting starts in the right most column:

Columns.png

Footnote Anchor/Footnote text

The footnote anchor is inserted at the current position. In a general RTL orientation, the frame for the footnote text has a RTL orientation, which can be changed of course. RTL in the footnote frame means also, the reference numbers are aligned on the right side:

Footnote.png

Numbering/Bullets

Numbering and bullets are always positioned on the right side of the paragraph. Of course the numbering characters may differ from country to country.

Numbering.png

Tables

Traveling through tables is done in a right to left and top to bottom order:

Table bidi.png

Cursor Traveling

There are two possible kinds of cursor traveling:

  1. Logical: The cursor moves to the next position in the logical order of the characters. The  →  key means proceeding forward in logical order, the  ←  key lets you step back one position in logical order. This holds for both, RTL and LTR orientation.
  2. Physical: The cursor moves to a neighbor of the current cursor position.  →  /  ←  lead to the next position right / left from the current position.

The same holds for text selection. Selecting text in logical order can lead to two visually disjoint regions of marked text, whereas selecting text in physical order and applying attributes to the selected area can make it necessary to apply these attributes to two disjoint sections in the representing string.

Which one is the more naturally way? Which one is the one used for working?

Cursor.png

Tab Stops

Our intention is to handle tab stops the way we handle them in the western version. Usually left tab stops are used in a western text, whereas we expect the right tab stop to be the most important one for RTL languages. Difficulties arise when for example trying to handle a left tab stop occurring in a LTR portion of a bidirectional formatting in a different way than a left tab stop in a RTL portion. Therefore tab stops in LTR portions have the same effect as the ones in RTL portions.

Tab.png

Frames

Using frames has similarities to tab stops. A LTR text portion can be broken apart by frames. Therefore, when having western text portions right and left from a frame (or any drawing objects), the reading direction is this:

Frames.png

Line Break

No special requirements are needed for performing line breaks in a bidirectional text formatting. A LTR portion of text within a general RTL orientation is broken into two portions, if it does not fit completely into the line. The first part of the western text portion that still fits into the line has to be read first, the second part is positioned at the beginning of the next line:

Linebreak.png

Ordering of Text Portions

The Bidirectional Algorithm, as introduced at http://www.unicode.org/unicode/reports/tr9/tr9-9.html describes how a given string containing different portions of text with different directions, has to be reordered in order to be visualized the correct way. In Unicode there are special characters to indicate an explicit directional embedding of text, e.g.,

  • LRE (0x202A) = left to right embedding
  • RLE (0x202B) = right to left embedding
  • PDF (0x202C) = pop directional format

Having this string in memory:

TEXT1 <LRE>text2 <RLE>TEXT3 <PDF>text4 <PDF>TEXT5

the result of the Bidirectional Algorithm is:

5TXET text2 3TXET text4 1TXET

Instead of evaluating these special Unicode characters and performing a reordering on the input string, we want to process the string in its logical order, changing the direction at the appropriate positions. Processing the string results in five different portions. Their sizes are given by the automatic detection of direction changes. This leads to the following visualization of the input string:

5TXET text4 3TXET text2 1TXET

Diacritics

Diacritics are markings above or below vowels, which specify the pronunciation of the vowels. They have to be stored in the same string with the other characters, but do not take an isolated position on the screen. Furthermore, a kind of vertical kerning for them would be desirable. There are single, double and combined diacritics. What is the preferred way to insert and edit diacritics? Should diacritics be selectable? Should they be considered during cursor traveling, i.e., you have to make several steps in order to pass one character with its diacritics? Or are they ignored during cursor traveling and only regarded during deletion?

Ligatures

Some combinations of two or more characters can be represented by one special characters. If there is such a special character depends on the font:

AB = C

Ligatures are defined inside the font, comparable to pairs for kerning. Pressing delete in front of the ligature has this result:

C => B

A backspace behind the ligature has this result:

C => A

The cursor traveling does not recognize the singe characters inside the ligature.

Numerals

In the Arab world, Hindi shapes are used to represent numerals, in opposite to the digits used in western text (1, 2, 3....). No matter what kind of digits are used, they are displayed from left to right. Hindi shapes are used for numerals, when typed after an Arabic character, or at the beginning of a RTL paragraph:

Hindi digits.png

Arabic digits are used, if they are typed after an Latin character of at the beginning of an LTR paragraph:

year 1997
Personal tools