Difference between revisions of "Writer/TOC"

From Apache OpenOffice Wiki
Jump to: navigation, search
 
(3 intermediate revisions by 2 users not shown)
Line 1: Line 1:
=<center>Table of Contents Improvements<center>=
+
=<center>Table of Contents Improvements</center>=
  
 
<center>'''On Path to Vision'''</center>
 
<center>'''On Path to Vision'''</center>
Line 31: Line 31:
 
'''Motivation'''
 
'''Motivation'''
  
The Formal Apache OpenOffice.org do not preserve the exact TOC entries contents via interpreting the TOC entries contents caches stored inside MS Office Word 2003 DOC format files, but generating TOC entries contents depend on collected heading paragraphs contents after loading whole document main contents.  
+
The current Apache OpenOffice.org do not preserve the exact TOC entries contents via interpreting the TOC entries contents caches stored inside MS Office Word 2003 DOC format files, but generating TOC entries contents depend on collected heading paragraphs contents after loading whole document main contents.  
  
 
Such TOC loading strategy inside Apache OpenOffice.org leads 3 main issues show as below:
 
Such TOC loading strategy inside Apache OpenOffice.org leads 3 main issues show as below:
Line 37: Line 37:
 
# Bad fidelity on representing specified type of MS Word DOC. Considering a MS Word DOC in which contains several heading paragraphs and a TOC. If we delete all the main contents except the TOC and save the document, then reopen the file inside MS Word, the TOC would be the exactly the same as before. But if we open it inside the Apache OpenOffice.org, the TOC will be totally empty;
 
# Bad fidelity on representing specified type of MS Word DOC. Considering a MS Word DOC in which contains several heading paragraphs and a TOC. If we delete all the main contents except the TOC and save the document, then reopen the file inside MS Word, the TOC would be the exactly the same as before. But if we open it inside the Apache OpenOffice.org, the TOC will be totally empty;
 
# The manually created/removed TOC entries contents will be lost; Some users of Word would like to add or remove TOC entries manually after generating TOC inside MS Word. Such manual modifications happens on TOC contents will be representing perfectly inside MS word when reopen the DOC files. But, such manual modifications will be lost when loading the DOC files inside Apache OpenOffice.org;
 
# The manually created/removed TOC entries contents will be lost; Some users of Word would like to add or remove TOC entries manually after generating TOC inside MS Word. Such manual modifications happens on TOC contents will be representing perfectly inside MS word when reopen the DOC files. But, such manual modifications will be lost when loading the DOC files inside Apache OpenOffice.org;
# The paragraph/text/field attributes assigned in TOC block will be lost; In some specified TOC generating mode, the paragraph/text/field attributes assigned on a heading paragraph may finally affect the TOC corresponding content entry representing. But in formal Apache OpenOffice.org, such type of TOC inside MS Word Document, will be generated follow the standard TOC entries paragraph/text/field formatting;
+
# The paragraph/text/field attributes assigned in TOC block will be lost; In some specified TOC generating mode, the paragraph/text/field attributes assigned on a heading paragraph may finally affect the TOC corresponding content entry representing. But in current Apache OpenOffice.org, such type of TOC inside MS Word Document, will be generated follow the standard TOC entries paragraph/text/field formatting;
  
 
'''Detailed Specification'''
 
'''Detailed Specification'''
Line 43: Line 43:
 
'''The original TOC loading process introduction and the improvement of this feature'''
 
'''The original TOC loading process introduction and the improvement of this feature'''
  
In the formal Word DOC TOC loading process, there are generally steps of work:
+
In the current Word DOC TOC loading process, there are generally steps of work:
  
 
# Verifying the exact position of TOC block in the document;
 
# Verifying the exact position of TOC block in the document;
Line 54: Line 54:
 
In this MS Word DOC filter improvement focus on TOC contents cache, we will give following strategy changes:
 
In this MS Word DOC filter improvement focus on TOC contents cache, we will give following strategy changes:
  
* Heading paragraphs collecting step removal, indicate the step 5 above;
+
* <s>Heading paragraphs collecting step removal, indicate the step 5 above;</s>
 
* TOC generating/updating step removal, indicate the step 6 above;
 
* TOC generating/updating step removal, indicate the step 6 above;
 
* TOC contents cache parsing step addition, expand the step 3 above;
 
* TOC contents cache parsing step addition, expand the step 3 above;
Line 77: Line 77:
  
 
The TOC contents cache preserved;
 
The TOC contents cache preserved;
| In further specified cases, some modifications may happens to the main contents, but the TOC was not updated before saving. In the formal Apache OpenOffice.org, loaded TOC will always keep accordance exactly with the main contents/heading paragraphs. With this feature, we just preserve the TOC contents cache recorded in the DOC document anyway.
+
| In further specified cases, some modifications may happens to the main contents, but the TOC was not updated before saving. In the current Apache OpenOffice.org, loaded TOC will always keep accordance exactly with the main contents/heading paragraphs. With this feature, we just preserve the TOC contents cache recorded in the DOC document anyway.
  
 
|-
 
|-
Line 372: Line 372:
 
As we know, MS Word 2003 binary format record SPRMs with corresponding CPs in PLC streams for specified types of properties. All the fields also applying the same data structure, and all the field corresponding field SPRMs are recorded in PLC stream named PLCFLD.
 
As we know, MS Word 2003 binary format record SPRMs with corresponding CPs in PLC streams for specified types of properties. All the fields also applying the same data structure, and all the field corresponding field SPRMs are recorded in PLC stream named PLCFLD.
  
'''The AOO formal design of TOC loading'''
+
'''The AOO current design of TOC loading'''
  
The formal design of loading TOC in MS Word 2003 binary format in AOO, is not a real “LOADING” way, but actually a “GENERATING” way, for the formal design of AOO not trying to parse the cached representation contents of TOC field.  
+
The current design of loading TOC in MS Word 2003 binary format in AOO, is not a real “LOADING” way, but actually a “GENERATING” way, for the current design of AOO not trying to parse the cached representation contents of TOC field.  
  
After catching the TOC field start key word in main content stream, the formal loading design will remark the TOC block position in the document and just parse the TOC field expression and parameters for creating the TOC entry token patterns, and jump over all the cached TOC field representation contents at all. The formal loading process will collect all the outline paragraphs with heading paragraph styles or corresponding outline level settings, when performing load of rest part of document, and generate TOC entries depend on the TOC entry tokens patterns one by one. And this mechanism of TOC loading leads the several issues as we already known.
+
After catching the TOC field start key word in main content stream, the current loading design will remark the TOC block position in the document and just parse the TOC field expression and parameters for creating the TOC entry token patterns, and jump over all the cached TOC field representation contents at all. The current loading process will collect all the outline paragraphs with heading paragraph styles or corresponding outline level settings, when performing load of rest part of document, and generate TOC entries depend on the TOC entry tokens patterns one by one. And this mechanism of TOC loading leads the several issues as we already known.
  
 
'''Detailed Design of TOC loading Enhancement'''
 
'''Detailed Design of TOC loading Enhancement'''
Line 479: Line 479:
  
  
[[Category:Writer/Effort]]
+
[[Category:Writer/Effort/Completed]]

Latest revision as of 16:16, 8 January 2014

Table of Contents Improvements

On Path to Vision


Author: chengjh/zhengfan(@apache.org)

Overall Description

TOC (Table of Contents) is a significant feature in AOO Writer. Although, it has provided powerful capabilities to benefit end users for productivity, the following areas, especially the fidelity with MS Word, still need improvements. I propose them and put them as the candidates of the next release.

Descriptions of Main Problems

Loading of MS Word TOC

Binary Format

  • The TOC data of a MS Word document is not parsed completely. And the actual TOC data is from silently updating once a MS Word Document loaded. Thus, the fidelity can not be ensured especially when the document contents that impact TOC have been changed after creating TOC in MS Word.
  • After TOC has been created in MS Word, and then the paragraphs applied with Heading styles are deleted or applied Heading styles un-applied to the paragraphs that have been collected into TOC. Once such MS Word binary document launched into Apache OpenOffice.org Writer, the TOC will disappear.
  • After TOC has been created in MS Word, and then new paragraphs are applied with Heading styles. Once such MS Word binary document launched into Apache OpenOffice.org Writer,new entries will be added to TOC.
  • The tab between chapter number and TOC entry lost when loading a MS Word document, which leads to different gap between chapter number and TOC entry. That looks different from MS Word.

Status

Ongoing

Function Specification

Abstract

Give a solution for preserving the TOC contents in DOC files, via interpreting corresponding TOC entries data inside the MS Office Word 2003 .DOC binary format file, instead of the current implementation, in which generating the TOC contents via collected heading line information of the main contents inside the DOC files.

Motivation

The current Apache OpenOffice.org do not preserve the exact TOC entries contents via interpreting the TOC entries contents caches stored inside MS Office Word 2003 DOC format files, but generating TOC entries contents depend on collected heading paragraphs contents after loading whole document main contents.

Such TOC loading strategy inside Apache OpenOffice.org leads 3 main issues show as below:

  1. Bad fidelity on representing specified type of MS Word DOC. Considering a MS Word DOC in which contains several heading paragraphs and a TOC. If we delete all the main contents except the TOC and save the document, then reopen the file inside MS Word, the TOC would be the exactly the same as before. But if we open it inside the Apache OpenOffice.org, the TOC will be totally empty;
  2. The manually created/removed TOC entries contents will be lost; Some users of Word would like to add or remove TOC entries manually after generating TOC inside MS Word. Such manual modifications happens on TOC contents will be representing perfectly inside MS word when reopen the DOC files. But, such manual modifications will be lost when loading the DOC files inside Apache OpenOffice.org;
  3. The paragraph/text/field attributes assigned in TOC block will be lost; In some specified TOC generating mode, the paragraph/text/field attributes assigned on a heading paragraph may finally affect the TOC corresponding content entry representing. But in current Apache OpenOffice.org, such type of TOC inside MS Word Document, will be generated follow the standard TOC entries paragraph/text/field formatting;

Detailed Specification

The original TOC loading process introduction and the improvement of this feature

In the current Word DOC TOC loading process, there are generally steps of work:

  1. Verifying the exact position of TOC block in the document;
  2. Parsing the TOC field expression and creating the internal TOC model with TOC entries pattern and TOC collecting rules accordingly;
  3. Jump over the TOC field representation cache part;
  4. Jump over the TOC field representation cache range corresponding paragraph/text/field attributes;
  5. Collecting the heading paragraphs while loading the main contents of the document depend on said collecting rules;
  6. Generating the TOC entries depend on said TOC entries pattern;

In this MS Word DOC filter improvement focus on TOC contents cache, we will give following strategy changes:

  • Heading paragraphs collecting step removal, indicate the step 5 above;
  • TOC generating/updating step removal, indicate the step 6 above;
  • TOC contents cache parsing step addition, expand the step 3 above;
  • TOC contents cache range corresponding paragraph/text/field attributes parsing, expend the step 4 above;

The behavioral difference leads by this improvement

This section is described by a user scenarios table.

#
Scenario Description
Comment
1
In Apache OpenOffice.org with this improvement, open MS Word DOC document that:
  • Has had several heading paragraphs and corresponding generated TOC inside;
  • All the main contents except the TOC are deleted;

Result:

The TOC contents cache preserved;

In further specified cases, some modifications may happens to the main contents, but the TOC was not updated before saving. In the current Apache OpenOffice.org, loaded TOC will always keep accordance exactly with the main contents/heading paragraphs. With this feature, we just preserve the TOC contents cache recorded in the DOC document anyway.
2
In Apache OpenOffice.org with this improvement, open MS Word DOC document that:
  • Has a generated TOC inside;
  • The TOC block was modified manually by user, such as inserted new paragraphs, or(and) deleted paragraphs;

Result:

The user manually modifications happened on TOC are preserved;

In some special manually modified TOC cases, the TOC formatting result may be not as good as the generated one.
3
In Apache OpenOffice.org with this improvement, open MS Word DOC document that:
  • Has several heading paragraphs and special text attributes such as strikethrough line and outline font and font color applied on whole or part of current paragraph;
  • Has a corresponding generated TOC inside;
    Result:
    The text attributes applied onto TOC entries accordingly;


Impact on Import/Export filters

Support of the new paragraph style's List Level attribute in import/export filter for the following file formats:

  • Microsoft Word binary format (WW8)

Design Description

The brief introduction of TOC in DOC files

The TOC record in main content stream

In MS Word 2003 Binary format, TOC was described as nested field, which field expression is “TOC” with external parameters. Same as all the types of field defined in MS Word 2003 binary format, TOC field start with key control word '0x13', followed with the field expression; After the field expression, if the field has the representing content inside (Toggled the field code off, in MS Word), then there will be an optional key control word '0x14' to separate the field expression and field representation contents. Finally, there will be another key control word '0x15' indicate the termination of the field.

As we described above, the TOC in said format is defined as a nested field, which actually means that each TOC entry inside the representation content of TOC, is composed by some other types of field, such as HYPERLINK field and Page Reference field. In MS Word, if create a TOC with default settings, the created TOC will contain several TOC entries indicate several outline paragraphs inside current document. Each of said entry is actually a hyperlink field append a paragraph break key word '0x0D'. Further more, there will be a PAGEREF type field at the last of HPERLINK field representation content, for representing the page number. Referring to the following tables:

composition of page reference field inside TOC

Start
Field expression
Separator
Field Representation
End
'0x13' “PAGEREF” External parameter '0x14' Page Numbering '0x15'

External Parameter of page reference field would include:

\h Creates a hyperlink to the bookmarked paragraph.
\p Causes the field to display its position relative to the source

bookmark. If the PAGEREF field is on the same page as the

bookmark, it omits "on page #" and returns "above" or "below"

only. If the PAGEREF field is not on the same page as the

bookmark, the string "on page #" is used.

Composition of hyperlink field inside TOC

Start
Field expression
Separator
Field Representation
End
'0x13' “HYPERLINK” External parameter '0x14' Common text Page reference field '0x15'

External Parameter of hyperlink field would include:

\l field-argument text in this switch's field-argument specifies a location in the file, such as a bookmark, where this hyperlink will jump.
\m Appends coordinates to a hyperlink for a server-side image map.
\n Causes the destination site to be opened in a new window.
\o field-argument text in this switch's field-argument specifies the Screen-Tip text for the hyperlink.
\t field-argument text in this switch's field-argument specifies the target to which the link should be redirected. Use this switch to link from a frames page to a page that you want to appear outside of the frames page.

The permitted values for text are:

  • _top, whole page (the default)
  • _self, same frame
  • _blank, new window
  • _parent, parent frame


Composition of TOC field

Start
Field expression
Separator
Field Representation
End
'0x13' “TOC” External parameter '0x14' TOC entries '0x15'

Composition of TOC entries:

TOC Entry TOC Entry TOC Entry TOC Entry …...

Composition of TOC entry:

Hyperlink Field '0x0D'

External Parameter of TOC field would include:

\a field-argument Includes captioned items, but omits caption labels and numbers. The identifier designated by text in this switch's field-argument corresponds to the caption label.

Use \c to build a table of captions with labels and numbers.

\b field-argument Includes entries only from the portion of the document marked by the bookmark named by text in this switch's field-argument.
\c field-argument Includes figures, tables, charts, and other items that are numbered

by a SEQ field. The sequence identifier designated by

text in this switch's field-argument, which corresponds to the

caption label, shall match the identifier in the corresponding SEQ

field.

\d field-argument When used with \s, the text in this switch's field-argument defines

the separator between sequence and page numbers. The default

separator is a hyphen (-).

\f field-argument Includes only those TC fields whose identifier exactly matches the

text in this switch's field-argument (which is typically a letter).

\h Makes the table of contents entries hyperlinks.
\l field-argument Includes TC fields that assign entries to one of the levels specified

by text in this switch's field-argument as a range having the form

startLevel-endLevel, where startLevel and endLevel are integers,

and startLevel has a value equal-to or less-than endLevel. TC fields

that assign entries to lower levels are skipped.

\n field-argument Without field-argument, omits page numbers from the table of

contents. Page numbers are omitted from all levels unless a range

of entry levels is specified by text in this switch's field-argument. A

range is specified as for \l.

\o field-argument Uses paragraphs formatted with all or the specified range of builtin

heading styles. Headings in a style range are specified by text in

this switch's field-argument using the notation specified as for \l,

where each integer corresponds to the style with a style ID of

HeadingX (e.g. 1 corresponds to Heading1). If no heading range is

specified, all heading levels used in the document are listed.

\p field-argument text in this switch's field-argument specifies a sequence of

characters that separate an entry and its page number. The default

is a tab with leader dots.

\s field-argument For entries numbered with a SEQ field, adds a prefix to

the page number. The prefix depends on the type of entry. text in

this switch's field-argument shall match the identifier in the SEQ

field.

\t field-argument Uses paragraphs formatted with styles other than the built-in

heading styles. text in this switch's field-argument specifies those

styles as a set of comma-separated doublets, with each doublet

being a comma-separated set of style name and table of content

level. \t can be combined with \o.

\u Uses the applied paragraph outline level.
\w Preserves tab entries within table entries.
\x Preserves newline characters within table entries.
\z Hides tab leader and page numbers in Web layout view.

The TOC record in PLC stream

As we know, MS Word 2003 binary format record SPRMs with corresponding CPs in PLC streams for specified types of properties. All the fields also applying the same data structure, and all the field corresponding field SPRMs are recorded in PLC stream named PLCFLD.

The AOO current design of TOC loading

The current design of loading TOC in MS Word 2003 binary format in AOO, is not a real “LOADING” way, but actually a “GENERATING” way, for the current design of AOO not trying to parse the cached representation contents of TOC field.

After catching the TOC field start key word in main content stream, the current loading design will remark the TOC block position in the document and just parse the TOC field expression and parameters for creating the TOC entry token patterns, and jump over all the cached TOC field representation contents at all. The current loading process will collect all the outline paragraphs with heading paragraph styles or corresponding outline level settings, when performing load of rest part of document, and generate TOC entries depend on the TOC entry tokens patterns one by one. And this mechanism of TOC loading leads the several issues as we already known.

Detailed Design of TOC loading Enhancement

TOC Cached representation content loading

For the TOC type field recorded in the main content stream, do no jump the cached representation contents any more.

  1. In the TOC loading function SwWW8ImplReader::Read_F_TOX(), return field status as FLD_TEXT, which indicate the rest part of field: representation content should be parsed, instead of FLD_OK, which indicate the rest part of field: representation content should be ignored;
  2. In the page reference field loading function SwWW8ImplReader::Read_F_PgRef(), return field status as FLD_TEXT, which indicate the rest part of field: representation content should be parsed, instead of FLD_OK, which indicate the rest part of field: representation content should be ignored, if current loading page reference field is inside a TOC field;
  3. Mark the TOC field and hyperlink field as nestable field types, in the external function AcceptableNestedField();
  4. When parsing a page reference field inside a TOC field cached representation contents, convert the page reference field representation content as the common text string, for specified interoperability issue of ODF.
  5. If there is no hyperlink field inside, convert the page reference field as a hyperlink;
  6. In the finial step of field loading process: SwWW8ImplReader:: End_Field(), give corresponding branch for dealing with TOC field, hyperlink field and page reference field.

TOC entries generating

For rest part of document loading, do not collecting and generating TOC entries anymore:

  1. in the TOC loading function SwWW8ImplReader::Read_F_TOX(), disable the TOC updating flag after parsing the TOC expression;

TOC block and rest part of document positing

For the TOC field start step and end step, move the CURRENT insertion position of document;

  1. in the TOC loading function SwWW8ImplReader::Read_F_TOX(), move the CURRENT insertion position into the TOC section;
  2. in the finial step of field loading process: SwWW8ImplReader:: End_Field(), move the CURRENT insertion position of document back to the next position of TOC section;

OOXML Format

Same with binary file format.

Status

Not Started...

Customized Formats of TOC Entry

Binary Format

The customized character attributes will be lost when loading MS Word TOC created by un-checking "Use hyperlinks instead of page numbers". To this kind of TOC,the customized character attributes of the target paragraphs can be collected into TOC in MS Word.

Status

Not Started

OOXML Format

Same with binary file format.

Status

Not Started

Export TOC to MS Word

Binary Format

  • Saving MS Word Binary Format Back

The width of tab between chapter numbering and TOC entry will be changed.

  • Saving ODT to MS Word Binary

The jumping hyperlink info will be lost when exporting odt TOC to MS Word binary TOC.

Status

Not Started

OOXML Format

Status

Not Started

TOC Jumping with Page Numbers Only

Jump info will be lost when loading MS Word TOC created by un-checking "Use hyperlinks instead of page numbers". To this kind of TOC,end users can only press ctrl+mouse to click the page number of the TOC entry for jumping in MS Word.

Status

Not Started


Accessibility

The current TOC dialog can not meet the accessibility requirements.

Status

Not Started

Usability

The current TOC dialog is difficult for end users to understand and use..Most end users can just only create a TOC by default, confusing to customize the attributes and styles.

Status

Not Started,need UX designer..

Comments

Personal tools