WordprocessingML

From Apache OpenOffice Wiki
Revision as of 12:45, 20 July 2012 by Bjcheny (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

WordprocessingML is the XML format used by Microsoft Word 2007/2010 and that is part of the Office Open XML specification. It defines all the structure for the word related data. Shapes/Textbox used in MS Word 2007 is described in VML.

Finding sample files

Use google and look for "download docx".

Code structure

We can divide the import filter into three main parts: XML parser(XML token handler), XSL templates and DomainMapper(content handler).

Main code could be found in below path:

\writerfilter\source\dmapper is the main part for parsing WordprocessingML content and handle the word data.

\writerfilter\source\ooxml is the XML parser for parsing the XML token from file.

WordProcessingML XML Token parser

OOXMLFastContextHandler is used as the main class for XML parsing, and it inherited from xml::sax::XFastContextHandler which is used to parse all the XML format files.

All the Handler used to parse WordprocessingML tokens is inherited from OOXMLFastContextHandler. The location is \writerfilter\source\ooxml. The class diagram is as below:

   e.g OOXMLFastContextHandler_wordprocessingml_CT_Picture
   This is used to parse the tokens defined for drawing pictures in word.

ContextHandler.jpg

XSL template for XML context handler

In \writerfilter\source\ooxml, we will see a lot .xsl files in it, the xsl(EXtensible Stylesheet Language) defines the template on how to generate the XML context handler, and after you build the writerfilter, you will see the .hxx and .cxx file corresponding to these xsl definition will be generated under the build path \misc. If we want to add a new context handler for WordProcessingML, we can only add the tokens definitions in data model, and the corresponding handler will be generate automatically according the templates for all.

Three kinds of xsl files defined:

  • XML Model – model.xml

Defines all the OOXML token and the relationships, we can add new if we want to support more XML tokens.

  • Class file templates

Defines the .hxx and .cxx file templates which used to generate handler for specific area in WordProcessingML.

   e.g. factorytools.xsl
  • ContextHandler templates

Defines the OOXML token contextHandler templates.

   e.g resourcestools.xsl

DomainMapper

This part is file content related, after parse the XML token from file through context handler, we will use DomainMapper as main stream handler to read the content from the XML tokens, and also decide how to arrange the content and how to insert into core model. The class diagram is as below:

DomainMapper.jpg

  • The source for this part is in below:
   \writerfilter\inc\domainmapper.hxx
   \writerfilter\source\DomainMapper.cxx
   \writerfilter\source\DomainMapper_Impl.cxx
   \writerfilter\source\DomainMapper_Impl.hxx

Known issues

There are several weak areas for OOXML file import

  • Unsupported object

Shape/Textbox

OLE

Control

Chart


  • Support with limitation

TOC

Field

Section

Footnote&Endnote

Personal tools