Difference between revisions of "Office Open XML"

From Apache OpenOffice Wiki
Jump to: navigation, search
(Added FileFormat section.)
(Added Import Design section)
Line 10: Line 10:
 
** ECMA-376          (http://www.ecma-international.org/publications/standards/Ecma-376.htm)
 
** ECMA-376          (http://www.ecma-international.org/publications/standards/Ecma-376.htm)
 
** (ECMA) TC-45      (http://www.ecma-international.org/news/TC45_current_work/TC45_available_docs.htm)
 
** (ECMA) TC-45      (http://www.ecma-international.org/news/TC45_current_work/TC45_available_docs.htm)
* Microsoft Extensions ([Overview http://msdn.microsoft.com/en-us/library/gg548604%28v=office.12%29.aspx]):
+
* Microsoft Extensions ([http://msdn.microsoft.com/en-us/library/gg548604%28v=office.12%29.aspx Overview]):
 
** for PowerPoint (http://msdn.microsoft.com/en-us/library/dd926741%28v=office.12%29.aspx)
 
** for PowerPoint (http://msdn.microsoft.com/en-us/library/dd926741%28v=office.12%29.aspx)
  
Line 22: Line 22:
  
 
OOXML files are, similar to ODF files, ZIP containers with one entry per XML stream.  OOXML calls these entries parts.
 
OOXML files are, similar to ODF files, ZIP containers with one entry per XML stream.  OOXML calls these entries parts.
 +
 +
 +
==Import Design==
 +
The design of the new OOXML import adds a layer of abstraction on top of a push parser.  The events for start tags, end tags or text are handled by the new import framework.  Rules for this are derived directly from the specifications.  These introduce the concept of complex types and simple types.  Very simplified a complex type describes parent-child relationships between elements while simple types describe the types of attribute values.  It is the task of the framework to do the translation from elements to complex types and preprocess attribute values according to their simple types.  This has several advantages over a classical bare-bones push parser:
 +
* There are a few cases where the same element is mapped to different complex types.  This disambiguation is now done automatically by the framework.
 +
* The callbacks are more readable because they are directly tied to a complex type which can be looked up in the spec.
 +
* Much of the low level processing is now done automatically and therefore
 +
** does not obfuscate the import code
 +
** is less error prone
 +
** requires the developer to write less code
 +
** is potentially faster
 +
 +
The connection between OOXML parser and importer callbacks is done via a domain specific language (DSL).  This, together with the automatic preprocessing of the specifications, allows the development of automatic analysis and processing programs.  These allow us to
 +
* analyze how much of the specification is handled by import callbacks and thus
 +
* tell us which complex types and attributes still need more work
 +
* compile documentation contained in the document code
 +
* track progress of the develpment
 +
* improve the development process by providing means to e.g. search for the implementation of a certain element or complex type
 +
* add logging and debugging functionality on demand

Revision as of 07:48, 21 May 2014

Office Open XML (OOXML) is an XML based file format that has been published as ISO 29500 and ECMA-376. It is used as default file format by Microsoft Office since 2007.

A new import filter is currently (as of May 2014) in development. Its design and implementation is described on this page. The legacy importer and exporter is described here.

File Format

The file format is described by several documents:

There are three main markup languages (MLs) for the three main applications:

  • WordprocessingML
  • SpreadsheetML
  • PresentationML

Markup languages that are shared by all applications are

  • DrawingML
  • VML (for legacy files)

OOXML files are, similar to ODF files, ZIP containers with one entry per XML stream. OOXML calls these entries parts.


Import Design

The design of the new OOXML import adds a layer of abstraction on top of a push parser. The events for start tags, end tags or text are handled by the new import framework. Rules for this are derived directly from the specifications. These introduce the concept of complex types and simple types. Very simplified a complex type describes parent-child relationships between elements while simple types describe the types of attribute values. It is the task of the framework to do the translation from elements to complex types and preprocess attribute values according to their simple types. This has several advantages over a classical bare-bones push parser:

  • There are a few cases where the same element is mapped to different complex types. This disambiguation is now done automatically by the framework.
  • The callbacks are more readable because they are directly tied to a complex type which can be looked up in the spec.
  • Much of the low level processing is now done automatically and therefore
    • does not obfuscate the import code
    • is less error prone
    • requires the developer to write less code
    • is potentially faster

The connection between OOXML parser and importer callbacks is done via a domain specific language (DSL). This, together with the automatic preprocessing of the specifications, allows the development of automatic analysis and processing programs. These allow us to

  • analyze how much of the specification is handled by import callbacks and thus
  • tell us which complex types and attributes still need more work
  • compile documentation contained in the document code
  • track progress of the develpment
  • improve the development process by providing means to e.g. search for the implementation of a certain element or complex type
  • add logging and debugging functionality on demand
Personal tools