Difference between revisions of "Writer/TOC"
(34 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
− | = | + | =<center>Table of Contents Improvements</center>= |
+ | |||
+ | <center>'''On Path to Vision'''</center> | ||
+ | |||
+ | |||
+ | <center>Author: chengjh/zhengfan(@apache.org)</center> | ||
==Overall Description== | ==Overall Description== | ||
− | TOC(Table of Contents) is a significant feature in | + | TOC (Table of Contents) is a significant feature in AOO Writer. Although, it has provided powerful capabilities to benefit end users for productivity, the following areas, especially the fidelity with MS Word, still need improvements. I propose them and put them as the candidates of the next release. |
==Descriptions of Main Problems== | ==Descriptions of Main Problems== | ||
− | |||
− | + | == Loading of MS Word TOC == | |
− | * The TOC data of a MS Word document is not parsed completely. And the actual TOC data is from silently updating once a MS Word Document loaded. Thus,the fidelity can not be ensured especially when the document contents that impact TOC have been changed after creating TOC in MS Word. | + | |
− | * After TOC has been created in MS Word,and then the paragraphs applied with Heading styles are deleted or applied Heading styles un-applied to the paragraphs that have been collected into TOC. Once such MS Word binary document launched into Apache OpenOffice.org Writer, the TOC will disappear. | + | |
− | * After TOC has been created in MS Word,and then new paragraphs are applied with Heading styles | + | ===Binary Format=== |
− | * The tab between chapter number and TOC entry lost when loading a MS Word document,which leads to different gap between chapter number and TOC entry. That looks different from MS Word. | + | * The TOC data of a MS Word document is not parsed completely. And the actual TOC data is from silently updating once a MS Word Document loaded. Thus, the fidelity can not be ensured especially when the document contents that impact TOC have been changed after creating TOC in MS Word. |
− | + | * After TOC has been created in MS Word, and then the paragraphs applied with Heading styles are deleted or applied Heading styles un-applied to the paragraphs that have been collected into TOC. Once such MS Word binary document launched into Apache OpenOffice.org Writer, the TOC will disappear. | |
+ | * After TOC has been created in MS Word, and then new paragraphs are applied with Heading styles. Once such MS Word binary document launched into Apache OpenOffice.org Writer,new entries will be added to TOC. | ||
+ | * The tab between chapter number and TOC entry lost when loading a MS Word document, which leads to different gap between chapter number and TOC entry. That looks different from MS Word. | ||
+ | ====Status==== | ||
Ongoing | Ongoing | ||
− | + | ====Function Specification==== | |
'''Abstract''' | '''Abstract''' | ||
Line 24: | Line 31: | ||
'''Motivation''' | '''Motivation''' | ||
− | The | + | The current Apache OpenOffice.org do not preserve the exact TOC entries contents via interpreting the TOC entries contents caches stored inside MS Office Word 2003 DOC format files, but generating TOC entries contents depend on collected heading paragraphs contents after loading whole document main contents. |
Such TOC loading strategy inside Apache OpenOffice.org leads 3 main issues show as below: | Such TOC loading strategy inside Apache OpenOffice.org leads 3 main issues show as below: | ||
Line 30: | Line 37: | ||
# Bad fidelity on representing specified type of MS Word DOC. Considering a MS Word DOC in which contains several heading paragraphs and a TOC. If we delete all the main contents except the TOC and save the document, then reopen the file inside MS Word, the TOC would be the exactly the same as before. But if we open it inside the Apache OpenOffice.org, the TOC will be totally empty; | # Bad fidelity on representing specified type of MS Word DOC. Considering a MS Word DOC in which contains several heading paragraphs and a TOC. If we delete all the main contents except the TOC and save the document, then reopen the file inside MS Word, the TOC would be the exactly the same as before. But if we open it inside the Apache OpenOffice.org, the TOC will be totally empty; | ||
# The manually created/removed TOC entries contents will be lost; Some users of Word would like to add or remove TOC entries manually after generating TOC inside MS Word. Such manual modifications happens on TOC contents will be representing perfectly inside MS word when reopen the DOC files. But, such manual modifications will be lost when loading the DOC files inside Apache OpenOffice.org; | # The manually created/removed TOC entries contents will be lost; Some users of Word would like to add or remove TOC entries manually after generating TOC inside MS Word. Such manual modifications happens on TOC contents will be representing perfectly inside MS word when reopen the DOC files. But, such manual modifications will be lost when loading the DOC files inside Apache OpenOffice.org; | ||
− | # The paragraph/text/field attributes assigned in TOC block will be lost; In some specified TOC generating mode, the paragraph/text/field attributes assigned on a heading paragraph may finally affect the TOC corresponding content entry representing. But in | + | # The paragraph/text/field attributes assigned in TOC block will be lost; In some specified TOC generating mode, the paragraph/text/field attributes assigned on a heading paragraph may finally affect the TOC corresponding content entry representing. But in current Apache OpenOffice.org, such type of TOC inside MS Word Document, will be generated follow the standard TOC entries paragraph/text/field formatting; |
+ | |||
+ | '''Detailed Specification''' | ||
− | + | '''The original TOC loading process introduction and the improvement of this feature''' | |
− | + | ||
− | In the | + | In the current Word DOC TOC loading process, there are generally steps of work: |
# Verifying the exact position of TOC block in the document; | # Verifying the exact position of TOC block in the document; | ||
Line 45: | Line 54: | ||
In this MS Word DOC filter improvement focus on TOC contents cache, we will give following strategy changes: | In this MS Word DOC filter improvement focus on TOC contents cache, we will give following strategy changes: | ||
− | * Heading paragraphs collecting step removal, indicate the step 5 above; | + | * <s>Heading paragraphs collecting step removal, indicate the step 5 above;</s> |
* TOC generating/updating step removal, indicate the step 6 above; | * TOC generating/updating step removal, indicate the step 6 above; | ||
* TOC contents cache parsing step addition, expand the step 3 above; | * TOC contents cache parsing step addition, expand the step 3 above; | ||
* TOC contents cache range corresponding paragraph/text/field attributes parsing, expend the step 4 above; | * TOC contents cache range corresponding paragraph/text/field attributes parsing, expend the step 4 above; | ||
− | + | '''The behavioral difference leads by this improvement''' | |
+ | |||
This section is described by a user scenarios table. | This section is described by a user scenarios table. | ||
− | |||
− | |||
− | |||
{| class="prettytable" | {| class="prettytable" | ||
! <center>#</center> | ! <center>#</center> | ||
Line 70: | Line 77: | ||
The TOC contents cache preserved; | The TOC contents cache preserved; | ||
− | | In further specified cases, some modifications may happens to the main contents, but the TOC was not updated before saving. In the | + | | In further specified cases, some modifications may happens to the main contents, but the TOC was not updated before saving. In the current Apache OpenOffice.org, loaded TOC will always keep accordance exactly with the main contents/heading paragraphs. With this feature, we just preserve the TOC contents cache recorded in the DOC document anyway. |
|- | |- | ||
Line 95: | Line 102: | ||
|} | |} | ||
− | + | ||
+ | '''Impact on Import/Export filters''' | ||
+ | |||
Support of the new paragraph style's List Level attribute in import/export filter for the following file formats: | Support of the new paragraph style's List Level attribute in import/export filter for the following file formats: | ||
* Microsoft Word binary format (WW8) | * Microsoft Word binary format (WW8) | ||
− | ''' | + | ====Design Description==== |
+ | |||
+ | '''The brief introduction of TOC in DOC files''' | ||
+ | |||
+ | '''The TOC record in main content stream''' | ||
+ | |||
+ | In MS Word 2003 Binary format, TOC was described as nested field, which field expression is “TOC” with external parameters. Same as all the types of field defined in MS Word 2003 binary format, TOC field start with key control word '0x13', followed with the field expression; After the field expression, if the field has the representing content inside (Toggled the field code off, in MS Word), then there will be an optional key control word '0x14' to separate the field expression and field representation contents. Finally, there will be another key control word '0x15' indicate the termination of the field. | ||
+ | |||
+ | As we described above, the TOC in said format is defined as a nested field, which actually means that each TOC entry inside the representation content of TOC, is composed by some other types of field, such as HYPERLINK field and Page Reference field. In MS Word, if create a TOC with default settings, the created TOC will contain several TOC entries indicate several outline paragraphs inside current document. Each of said entry is actually a hyperlink field append a paragraph break key word '0x0D'. Further more, there will be a PAGEREF type field at the last of HPERLINK field representation content, for representing the page number. Referring to the following tables: | ||
+ | |||
+ | '''composition of page reference field inside TOC''' | ||
+ | |||
+ | {| class="prettytable" | ||
+ | | <center>'''Start'''</center> | ||
+ | | colspan="2" | <center>'''Field expression'''</center> | ||
+ | | <center>'''Separator '''</center> | ||
+ | | <center>'''Field Representation'''</center> | ||
+ | | <center>'''End'''</center> | ||
+ | |||
+ | |- | ||
+ | | '0x13' | ||
+ | | “PAGEREF” | ||
+ | | External parameter | ||
+ | | '0x14' | ||
+ | | Page Numbering | ||
+ | | '0x15' | ||
+ | |||
+ | |} | ||
+ | '''External Parameter of page reference field would include:''' | ||
+ | |||
+ | {| class="prettytable" | ||
+ | | \h | ||
+ | | Creates a hyperlink to the bookmarked paragraph. | ||
+ | |||
+ | |- | ||
+ | | \p | ||
+ | | Causes the field to display its position relative to the source | ||
+ | |||
+ | bookmark. If the PAGEREF field is on the same page as the | ||
+ | |||
+ | bookmark, it omits "on page #" and returns "above" or "below" | ||
+ | |||
+ | only. If the PAGEREF field is not on the same page as the | ||
+ | |||
+ | bookmark, the string "on page #" is used. | ||
+ | |||
+ | |} | ||
+ | '''Composition of hyperlink field inside TOC''' | ||
+ | |||
+ | {| class="prettytable" | ||
+ | | <center>'''Start'''</center> | ||
+ | | colspan="2" | <center>'''Field expression'''</center> | ||
+ | | <center>'''Separator '''</center> | ||
+ | | colspan="2" | <center>'''Field Representation'''</center> | ||
+ | | <center>'''End'''</center> | ||
+ | |||
+ | |- | ||
+ | | '0x13' | ||
+ | | “HYPERLINK” | ||
+ | | External parameter | ||
+ | | '0x14' | ||
+ | | Common text | ||
+ | | Page reference field | ||
+ | | '0x15' | ||
+ | |||
+ | |} | ||
+ | '''External Parameter of hyperlink field would include:''' | ||
+ | |||
+ | {| class="prettytable" | ||
+ | | \l ''field-argument'' | ||
+ | | ''text ''in this switch's ''field-argument ''specifies a location in the file, such as a bookmark, where this hyperlink will jump. | ||
+ | |||
+ | |- | ||
+ | | \m | ||
+ | | Appends coordinates to a hyperlink for a server-side image map. | ||
+ | |||
+ | |- | ||
+ | | \n | ||
+ | | Causes the destination site to be opened in a new window. | ||
+ | |||
+ | |- | ||
+ | | \o ''field-argument'' | ||
+ | | ''text ''in this switch's ''field-argument ''specifies the Screen-Tip text for the hyperlink. | ||
+ | |||
+ | |- | ||
+ | | \t ''field-argument'' | ||
+ | | ''text ''in this switch's ''field-argument ''specifies the target to which the link should be redirected. Use this switch to link from a frames page to a page that you want to appear outside of the frames page. | ||
+ | |||
+ | The permitted values for ''text ''are: | ||
+ | |||
+ | * _top, whole page (the default) | ||
+ | * _self, same frame | ||
+ | * _blank, new window | ||
+ | * _parent, parent frame | ||
+ | |||
+ | |||
+ | |||
+ | |} | ||
+ | '''Composition of TOC field''' | ||
+ | |||
+ | {| class="prettytable" | ||
+ | | <center>'''Start'''</center> | ||
+ | | colspan="2" | <center>'''Field expression'''</center> | ||
+ | | <center>'''Separator '''</center> | ||
+ | | <center>'''Field Representation'''</center> | ||
+ | | <center>'''End'''</center> | ||
+ | |||
+ | |- | ||
+ | | '0x13' | ||
+ | | “TOC” | ||
+ | | External parameter | ||
+ | | '0x14' | ||
+ | | TOC entries | ||
+ | | '0x15' | ||
+ | |||
+ | |} | ||
+ | '''Composition of TOC entries:''' | ||
+ | |||
+ | {| class="prettytable" | ||
+ | | TOC Entry | ||
+ | | TOC Entry | ||
+ | | TOC Entry | ||
+ | | TOC Entry | ||
+ | | …... | ||
+ | |||
+ | |} | ||
+ | '''Composition of TOC entry:''' | ||
+ | |||
+ | {| class="prettytable" | ||
+ | | Hyperlink Field | ||
+ | | '0x0D' | ||
+ | |||
+ | |} | ||
+ | '''External Parameter of TOC field would include:''' | ||
+ | |||
+ | {| class="prettytable" | ||
+ | | \a'' field-argument'' | ||
+ | | Includes captioned items, but omits caption labels and numbers. The identifier designated by ''text ''in this switch's ''field-argument ''corresponds to the caption label. | ||
+ | |||
+ | Use \c to build a table of captions with labels and numbers. | ||
+ | |||
+ | |- | ||
+ | | \b ''field-argument '' | ||
+ | | Includes entries only from the portion of the document marked by the bookmark named by ''text ''in this switch's ''field-argument''. | ||
+ | |||
+ | |- | ||
+ | | \c ''field-argument'' | ||
+ | | Includes figures, tables, charts, and other items that are numbered | ||
+ | |||
+ | by a SEQ field. The sequence identifier designated by | ||
+ | |||
+ | ''text ''in this switch's ''field-argument'', which corresponds to the | ||
+ | |||
+ | caption label, shall match the identifier in the corresponding SEQ | ||
+ | |||
+ | field. | ||
+ | |||
+ | |- | ||
+ | | \d'' field-argument'' | ||
+ | | When used with \s, the ''text ''in this switch's ''field-argument ''defines | ||
+ | |||
+ | the separator between sequence and page numbers. The default | ||
+ | |||
+ | separator is a hyphen (-). | ||
+ | |||
+ | |- | ||
+ | | \f'' field-argument '' | ||
+ | | Includes only those TC fields whose identifier exactly matches the | ||
+ | |||
+ | ''text ''in this switch's ''field-argument ''(which is typically a letter). | ||
+ | |||
+ | |- | ||
+ | | \h | ||
+ | | Makes the table of contents entries hyperlinks. | ||
+ | |||
+ | |- | ||
+ | | \l ''field-argument '' | ||
+ | | Includes TC fields that assign entries to one of the levels specified | ||
+ | |||
+ | by ''text ''in this switch's ''field-argument ''as a range having the form | ||
+ | |||
+ | ''startLevel''-''endLevel'', where ''startLevel ''and ''endLevel ''are integers, | ||
+ | |||
+ | and ''startLevel ''has a value equal-to or less-than ''endLevel''. TC fields | ||
+ | |||
+ | that assign entries to lower levels are skipped. | ||
+ | |||
+ | |- | ||
+ | | \n ''field-argument'' | ||
+ | | Without ''field-argument'', omits page numbers from the table of | ||
+ | |||
+ | contents. Page numbers are omitted from all levels unless a range | ||
+ | |||
+ | of entry levels is specified by ''text ''in this switch's ''field-argument''. A | ||
+ | |||
+ | range is specified as for \l. | ||
+ | |||
+ | |- | ||
+ | | \o ''field-argument'' | ||
+ | | Uses paragraphs formatted with all or the specified range of builtin | ||
+ | |||
+ | heading styles. Headings in a style range are specified by ''text ''in | ||
+ | |||
+ | this switch's ''field-argument ''using the notation specified as for \l, | ||
+ | |||
+ | where each integer corresponds to the style with a style ID of | ||
+ | |||
+ | HeadingX (e.g. 1 corresponds to Heading1). If no heading range is | ||
+ | |||
+ | specified, all heading levels used in the document are listed. | ||
+ | |||
+ | |- | ||
+ | | \p ''field-argument '' | ||
+ | | ''text ''in this switch's ''field-argument ''specifies a sequence of | ||
+ | |||
+ | characters that separate an entry and its page number. The default | ||
+ | |||
+ | is a tab with leader dots. | ||
+ | |||
+ | |- | ||
+ | | \s ''field-argument'' | ||
+ | | For entries numbered with a SEQ field, adds a prefix to | ||
+ | |||
+ | the page number. The prefix depends on the type of entry. ''text ''in | ||
+ | this switch's ''field-argument ''shall match the identifier in the SEQ | ||
+ | field. | ||
− | ''' | + | |- |
+ | | \t ''field-argument'' | ||
+ | | Uses paragraphs formatted with styles other than the built-in | ||
+ | |||
+ | heading styles. ''text ''in this switch's ''field-argument ''specifies those | ||
+ | |||
+ | styles as a set of comma-separated doublets, with each doublet | ||
+ | |||
+ | being a comma-separated set of style name and table of content | ||
+ | |||
+ | level. \t can be combined with \o. | ||
+ | |||
+ | |- | ||
+ | | \u | ||
+ | | Uses the applied paragraph outline level. | ||
+ | |||
+ | |- | ||
+ | | \w | ||
+ | | Preserves tab entries within table entries. | ||
+ | |||
+ | |- | ||
+ | | \x | ||
+ | | Preserves newline characters within table entries. | ||
+ | |||
+ | |- | ||
+ | | \z | ||
+ | | Hides tab leader and page numbers in Web layout view. | ||
+ | |||
+ | |} | ||
+ | |||
+ | '''The TOC record in PLC stream''' | ||
+ | |||
+ | As we know, MS Word 2003 binary format record SPRMs with corresponding CPs in PLC streams for specified types of properties. All the fields also applying the same data structure, and all the field corresponding field SPRMs are recorded in PLC stream named PLCFLD. | ||
+ | |||
+ | '''The AOO current design of TOC loading''' | ||
+ | |||
+ | The current design of loading TOC in MS Word 2003 binary format in AOO, is not a real “LOADING” way, but actually a “GENERATING” way, for the current design of AOO not trying to parse the cached representation contents of TOC field. | ||
+ | |||
+ | After catching the TOC field start key word in main content stream, the current loading design will remark the TOC block position in the document and just parse the TOC field expression and parameters for creating the TOC entry token patterns, and jump over all the cached TOC field representation contents at all. The current loading process will collect all the outline paragraphs with heading paragraph styles or corresponding outline level settings, when performing load of rest part of document, and generate TOC entries depend on the TOC entry tokens patterns one by one. And this mechanism of TOC loading leads the several issues as we already known. | ||
+ | |||
+ | '''Detailed Design of TOC loading Enhancement''' | ||
+ | |||
+ | '''TOC Cached representation content loading''' | ||
+ | |||
+ | For the TOC type field recorded in the main content stream, do no jump the cached representation contents any more. | ||
+ | |||
+ | # In the TOC loading function ''SwWW8ImplReader::Read_F_TOX()'', return field status as FLD_TEXT, which indicate the rest part of field: representation content should be parsed, instead of FLD_OK, which indicate the rest part of field: representation content should be ignored; | ||
+ | # In the page reference field loading function ''SwWW8ImplReader::Read_F_PgRef()'', return field status as FLD_TEXT, which indicate the rest part of field: representation content should be parsed, instead of FLD_OK, which indicate the rest part of field: representation content should be ignored, if current loading page reference field is inside a TOC field; | ||
+ | # Mark the TOC field and hyperlink field as nestable field types, in the external function ''AcceptableNestedField()''<nowiki>;</nowiki> | ||
+ | # When parsing a page reference field inside a TOC field cached representation contents, convert the page reference field representation content as the common text string, for specified interoperability issue of ODF. | ||
+ | # If there is no hyperlink field inside, convert the page reference field as a hyperlink; | ||
+ | # In the finial step of field loading process: ''SwWW8ImplReader:: End_Field()'', give corresponding branch for dealing with TOC field, hyperlink field and page reference field. | ||
+ | |||
+ | '''TOC entries generating''' | ||
+ | |||
+ | For rest part of document loading, do not collecting and generating TOC entries anymore: | ||
+ | |||
+ | # in the TOC loading function ''SwWW8ImplReader::Read_F_TOX(), ''disable the TOC updating flag after parsing the TOC expression; | ||
+ | |||
+ | '''TOC block and rest part of document positing''' | ||
+ | |||
+ | For the TOC field start step and end step, move the CURRENT insertion position of document; | ||
+ | |||
+ | # in the TOC loading function ''SwWW8ImplReader::Read_F_TOX(), ''move the CURRENT insertion position into the TOC section; | ||
+ | # in the finial step of field loading process: ''SwWW8ImplReader:: End_Field()'', move the CURRENT insertion position of document back to the next position of TOC section; | ||
+ | |||
+ | ===OOXML Format=== | ||
Same with binary file format. | Same with binary file format. | ||
+ | ====Status==== | ||
− | + | Not Started... | |
− | + | ==Customized Formats of TOC Entry== | |
+ | |||
+ | ===Binary Format=== | ||
The customized character attributes will be lost when loading MS Word TOC created by un-checking "Use hyperlinks instead of page numbers". To this kind of TOC,the customized character attributes of the target paragraphs can be collected into TOC in MS Word. | The customized character attributes will be lost when loading MS Word TOC created by un-checking "Use hyperlinks instead of page numbers". To this kind of TOC,the customized character attributes of the target paragraphs can be collected into TOC in MS Word. | ||
− | + | ====Status==== | |
+ | |||
+ | Not Started | ||
+ | |||
+ | ===OOXML Format=== | ||
Same with binary file format. | Same with binary file format. | ||
+ | ====Status==== | ||
− | + | Not Started | |
+ | |||
+ | ==Export TOC to MS Word== | ||
+ | |||
+ | ===Binary Format=== | ||
− | |||
* Saving MS Word Binary Format Back | * Saving MS Word Binary Format Back | ||
The width of tab between chapter numbering and TOC entry will be changed. | The width of tab between chapter numbering and TOC entry will be changed. | ||
Line 129: | Line 440: | ||
The jumping hyperlink info will be lost when exporting odt TOC to MS Word binary TOC. | The jumping hyperlink info will be lost when exporting odt TOC to MS Word binary TOC. | ||
− | + | ====Status==== | |
+ | Not Started | ||
− | + | ===OOXML Format=== | |
− | + | ====Status==== | |
+ | Not Started | ||
+ | ==TOC Jumping with Page Numbers Only== | ||
− | + | Jump info will be lost when loading MS Word TOC created by un-checking "Use hyperlinks instead of page numbers". To this kind of TOC,end users can only press ctrl+mouse to click the page number of the TOC entry for jumping in MS Word. | |
− | + | ===Status=== | |
+ | Not Started | ||
− | + | ==Accessibility== | |
− | The current TOC dialog | + | The current TOC dialog can not meet the accessibility requirements. |
− | == | + | ===Status=== |
− | |||
− | |||
+ | Not Started | ||
+ | ==Usability== | ||
+ | The current TOC dialog is difficult for end users to understand and use..Most end users can just only create a TOC by default, confusing to customize the attributes and styles. | ||
+ | ===Status=== | ||
+ | Not Started,need UX designer.. | ||
==Comments== | ==Comments== | ||
Line 161: | Line 479: | ||
− | [[Category:Writer/Effort]] | + | [[Category:Writer/Effort/Completed]] |
Latest revision as of 16:16, 8 January 2014
Table of Contents Improvements
Overall Description
TOC (Table of Contents) is a significant feature in AOO Writer. Although, it has provided powerful capabilities to benefit end users for productivity, the following areas, especially the fidelity with MS Word, still need improvements. I propose them and put them as the candidates of the next release.
Descriptions of Main Problems
Loading of MS Word TOC
Binary Format
- The TOC data of a MS Word document is not parsed completely. And the actual TOC data is from silently updating once a MS Word Document loaded. Thus, the fidelity can not be ensured especially when the document contents that impact TOC have been changed after creating TOC in MS Word.
- After TOC has been created in MS Word, and then the paragraphs applied with Heading styles are deleted or applied Heading styles un-applied to the paragraphs that have been collected into TOC. Once such MS Word binary document launched into Apache OpenOffice.org Writer, the TOC will disappear.
- After TOC has been created in MS Word, and then new paragraphs are applied with Heading styles. Once such MS Word binary document launched into Apache OpenOffice.org Writer,new entries will be added to TOC.
- The tab between chapter number and TOC entry lost when loading a MS Word document, which leads to different gap between chapter number and TOC entry. That looks different from MS Word.
Status
Ongoing
Function Specification
Abstract
Give a solution for preserving the TOC contents in DOC files, via interpreting corresponding TOC entries data inside the MS Office Word 2003 .DOC binary format file, instead of the current implementation, in which generating the TOC contents via collected heading line information of the main contents inside the DOC files.
Motivation
The current Apache OpenOffice.org do not preserve the exact TOC entries contents via interpreting the TOC entries contents caches stored inside MS Office Word 2003 DOC format files, but generating TOC entries contents depend on collected heading paragraphs contents after loading whole document main contents.
Such TOC loading strategy inside Apache OpenOffice.org leads 3 main issues show as below:
- Bad fidelity on representing specified type of MS Word DOC. Considering a MS Word DOC in which contains several heading paragraphs and a TOC. If we delete all the main contents except the TOC and save the document, then reopen the file inside MS Word, the TOC would be the exactly the same as before. But if we open it inside the Apache OpenOffice.org, the TOC will be totally empty;
- The manually created/removed TOC entries contents will be lost; Some users of Word would like to add or remove TOC entries manually after generating TOC inside MS Word. Such manual modifications happens on TOC contents will be representing perfectly inside MS word when reopen the DOC files. But, such manual modifications will be lost when loading the DOC files inside Apache OpenOffice.org;
- The paragraph/text/field attributes assigned in TOC block will be lost; In some specified TOC generating mode, the paragraph/text/field attributes assigned on a heading paragraph may finally affect the TOC corresponding content entry representing. But in current Apache OpenOffice.org, such type of TOC inside MS Word Document, will be generated follow the standard TOC entries paragraph/text/field formatting;
Detailed Specification
The original TOC loading process introduction and the improvement of this feature
In the current Word DOC TOC loading process, there are generally steps of work:
- Verifying the exact position of TOC block in the document;
- Parsing the TOC field expression and creating the internal TOC model with TOC entries pattern and TOC collecting rules accordingly;
- Jump over the TOC field representation cache part;
- Jump over the TOC field representation cache range corresponding paragraph/text/field attributes;
- Collecting the heading paragraphs while loading the main contents of the document depend on said collecting rules;
- Generating the TOC entries depend on said TOC entries pattern;
In this MS Word DOC filter improvement focus on TOC contents cache, we will give following strategy changes:
Heading paragraphs collecting step removal, indicate the step 5 above;- TOC generating/updating step removal, indicate the step 6 above;
- TOC contents cache parsing step addition, expand the step 3 above;
- TOC contents cache range corresponding paragraph/text/field attributes parsing, expend the step 4 above;
The behavioral difference leads by this improvement
This section is described by a user scenarios table.
In Apache OpenOffice.org with this improvement, open MS Word DOC document that:
Result: The TOC contents cache preserved; |
In further specified cases, some modifications may happens to the main contents, but the TOC was not updated before saving. In the current Apache OpenOffice.org, loaded TOC will always keep accordance exactly with the main contents/heading paragraphs. With this feature, we just preserve the TOC contents cache recorded in the DOC document anyway. | |
In Apache OpenOffice.org with this improvement, open MS Word DOC document that:
Result: The user manually modifications happened on TOC are preserved; |
In some special manually modified TOC cases, the TOC formatting result may be not as good as the generated one. | |
In Apache OpenOffice.org with this improvement, open MS Word DOC document that:
|
Impact on Import/Export filters
Support of the new paragraph style's List Level attribute in import/export filter for the following file formats:
- Microsoft Word binary format (WW8)
Design Description
The brief introduction of TOC in DOC files
The TOC record in main content stream
In MS Word 2003 Binary format, TOC was described as nested field, which field expression is “TOC” with external parameters. Same as all the types of field defined in MS Word 2003 binary format, TOC field start with key control word '0x13', followed with the field expression; After the field expression, if the field has the representing content inside (Toggled the field code off, in MS Word), then there will be an optional key control word '0x14' to separate the field expression and field representation contents. Finally, there will be another key control word '0x15' indicate the termination of the field.
As we described above, the TOC in said format is defined as a nested field, which actually means that each TOC entry inside the representation content of TOC, is composed by some other types of field, such as HYPERLINK field and Page Reference field. In MS Word, if create a TOC with default settings, the created TOC will contain several TOC entries indicate several outline paragraphs inside current document. Each of said entry is actually a hyperlink field append a paragraph break key word '0x0D'. Further more, there will be a PAGEREF type field at the last of HPERLINK field representation content, for representing the page number. Referring to the following tables:
composition of page reference field inside TOC
'0x13' | “PAGEREF” | External parameter | '0x14' | Page Numbering | '0x15' |
External Parameter of page reference field would include:
\h | Creates a hyperlink to the bookmarked paragraph. |
\p | Causes the field to display its position relative to the source
bookmark. If the PAGEREF field is on the same page as the bookmark, it omits "on page #" and returns "above" or "below" only. If the PAGEREF field is not on the same page as the bookmark, the string "on page #" is used. |
Composition of hyperlink field inside TOC
'0x13' | “HYPERLINK” | External parameter | '0x14' | Common text | Page reference field | '0x15' |
External Parameter of hyperlink field would include:
\l field-argument | text in this switch's field-argument specifies a location in the file, such as a bookmark, where this hyperlink will jump. |
\m | Appends coordinates to a hyperlink for a server-side image map. |
\n | Causes the destination site to be opened in a new window. |
\o field-argument | text in this switch's field-argument specifies the Screen-Tip text for the hyperlink. |
\t field-argument | text in this switch's field-argument specifies the target to which the link should be redirected. Use this switch to link from a frames page to a page that you want to appear outside of the frames page.
The permitted values for text are:
|
Composition of TOC field
'0x13' | “TOC” | External parameter | '0x14' | TOC entries | '0x15' |
Composition of TOC entries:
TOC Entry | TOC Entry | TOC Entry | TOC Entry | …... |
Composition of TOC entry:
Hyperlink Field | '0x0D' |
External Parameter of TOC field would include:
\a field-argument | Includes captioned items, but omits caption labels and numbers. The identifier designated by text in this switch's field-argument corresponds to the caption label.
Use \c to build a table of captions with labels and numbers. |
\b field-argument | Includes entries only from the portion of the document marked by the bookmark named by text in this switch's field-argument. |
\c field-argument | Includes figures, tables, charts, and other items that are numbered
by a SEQ field. The sequence identifier designated by text in this switch's field-argument, which corresponds to the caption label, shall match the identifier in the corresponding SEQ field. |
\d field-argument | When used with \s, the text in this switch's field-argument defines
the separator between sequence and page numbers. The default separator is a hyphen (-). |
\f field-argument | Includes only those TC fields whose identifier exactly matches the
text in this switch's field-argument (which is typically a letter). |
\h | Makes the table of contents entries hyperlinks. |
\l field-argument | Includes TC fields that assign entries to one of the levels specified
by text in this switch's field-argument as a range having the form startLevel-endLevel, where startLevel and endLevel are integers, and startLevel has a value equal-to or less-than endLevel. TC fields that assign entries to lower levels are skipped. |
\n field-argument | Without field-argument, omits page numbers from the table of
contents. Page numbers are omitted from all levels unless a range of entry levels is specified by text in this switch's field-argument. A range is specified as for \l. |
\o field-argument | Uses paragraphs formatted with all or the specified range of builtin
heading styles. Headings in a style range are specified by text in this switch's field-argument using the notation specified as for \l, where each integer corresponds to the style with a style ID of HeadingX (e.g. 1 corresponds to Heading1). If no heading range is specified, all heading levels used in the document are listed. |
\p field-argument | text in this switch's field-argument specifies a sequence of
characters that separate an entry and its page number. The default is a tab with leader dots. |
\s field-argument | For entries numbered with a SEQ field, adds a prefix to
the page number. The prefix depends on the type of entry. text in this switch's field-argument shall match the identifier in the SEQ field. |
\t field-argument | Uses paragraphs formatted with styles other than the built-in
heading styles. text in this switch's field-argument specifies those styles as a set of comma-separated doublets, with each doublet being a comma-separated set of style name and table of content level. \t can be combined with \o. |
\u | Uses the applied paragraph outline level. |
\w | Preserves tab entries within table entries. |
\x | Preserves newline characters within table entries. |
\z | Hides tab leader and page numbers in Web layout view. |
The TOC record in PLC stream
As we know, MS Word 2003 binary format record SPRMs with corresponding CPs in PLC streams for specified types of properties. All the fields also applying the same data structure, and all the field corresponding field SPRMs are recorded in PLC stream named PLCFLD.
The AOO current design of TOC loading
The current design of loading TOC in MS Word 2003 binary format in AOO, is not a real “LOADING” way, but actually a “GENERATING” way, for the current design of AOO not trying to parse the cached representation contents of TOC field.
After catching the TOC field start key word in main content stream, the current loading design will remark the TOC block position in the document and just parse the TOC field expression and parameters for creating the TOC entry token patterns, and jump over all the cached TOC field representation contents at all. The current loading process will collect all the outline paragraphs with heading paragraph styles or corresponding outline level settings, when performing load of rest part of document, and generate TOC entries depend on the TOC entry tokens patterns one by one. And this mechanism of TOC loading leads the several issues as we already known.
Detailed Design of TOC loading Enhancement
TOC Cached representation content loading
For the TOC type field recorded in the main content stream, do no jump the cached representation contents any more.
- In the TOC loading function SwWW8ImplReader::Read_F_TOX(), return field status as FLD_TEXT, which indicate the rest part of field: representation content should be parsed, instead of FLD_OK, which indicate the rest part of field: representation content should be ignored;
- In the page reference field loading function SwWW8ImplReader::Read_F_PgRef(), return field status as FLD_TEXT, which indicate the rest part of field: representation content should be parsed, instead of FLD_OK, which indicate the rest part of field: representation content should be ignored, if current loading page reference field is inside a TOC field;
- Mark the TOC field and hyperlink field as nestable field types, in the external function AcceptableNestedField();
- When parsing a page reference field inside a TOC field cached representation contents, convert the page reference field representation content as the common text string, for specified interoperability issue of ODF.
- If there is no hyperlink field inside, convert the page reference field as a hyperlink;
- In the finial step of field loading process: SwWW8ImplReader:: End_Field(), give corresponding branch for dealing with TOC field, hyperlink field and page reference field.
TOC entries generating
For rest part of document loading, do not collecting and generating TOC entries anymore:
- in the TOC loading function SwWW8ImplReader::Read_F_TOX(), disable the TOC updating flag after parsing the TOC expression;
TOC block and rest part of document positing
For the TOC field start step and end step, move the CURRENT insertion position of document;
- in the TOC loading function SwWW8ImplReader::Read_F_TOX(), move the CURRENT insertion position into the TOC section;
- in the finial step of field loading process: SwWW8ImplReader:: End_Field(), move the CURRENT insertion position of document back to the next position of TOC section;
OOXML Format
Same with binary file format.
Status
Not Started...
Customized Formats of TOC Entry
Binary Format
The customized character attributes will be lost when loading MS Word TOC created by un-checking "Use hyperlinks instead of page numbers". To this kind of TOC,the customized character attributes of the target paragraphs can be collected into TOC in MS Word.
Status
Not Started
OOXML Format
Same with binary file format.
Status
Not Started
Export TOC to MS Word
Binary Format
- Saving MS Word Binary Format Back
The width of tab between chapter numbering and TOC entry will be changed.
- Saving ODT to MS Word Binary
The jumping hyperlink info will be lost when exporting odt TOC to MS Word binary TOC.
Status
Not Started
OOXML Format
Status
Not Started
TOC Jumping with Page Numbers Only
Jump info will be lost when loading MS Word TOC created by un-checking "Use hyperlinks instead of page numbers". To this kind of TOC,end users can only press ctrl+mouse to click the page number of the TOC entry for jumping in MS Word.
Status
Not Started
Accessibility
The current TOC dialog can not meet the accessibility requirements.
Status
Not Started
Usability
The current TOC dialog is difficult for end users to understand and use..Most end users can just only create a TOC by default, confusing to customize the attributes and styles.
Status
Not Started,need UX designer..