OOXML/Markup Compatibility and Extensibility

From Apache OpenOffice Wiki
< OOXML
Revision as of 12:43, 15 January 2015 by Adailton (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Although the OOXML spec defines a specific set of allowed elements, Microsoft sometimes extend this with additional proprietary elements that are specific to new versions of Office. For example, if you insert a shape into a document in Word 2013, it will be defined in terms of a "word processing shape" element structure, which is not part of the OOXML spec. For the purposes of compatibility with older versions of Word however, they include a second version of the shape which uses an element structure that is defined in the spec, albeit using the legacy VML drawing format.

Part 3 of the OOXML spec defines "Markup compatibility and extensibility", which is the means by which a document can include multiple versions of the same piece of content in different formats. The XML is structured in such a way that applications which know how to work with elements in particular namespaces can use the more information-rich version of the content, while those which support only those elements defined in the standard can use an alternate version of the content. The XML structure is conceptually like a series of if-then statements with an else clause at the end.

As an example, let's consider a shape object in a .docx file created in Word 2013. A the top of the file, we have the following namespace definitions:

<w:document
  xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape"
  xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"
  xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">

Then in the body of the document, the shape is stored as follows

<mc:AlternateContent>
  <mc:Choice Requires="wps">
    <w:drawing> ... </w:drawing>
  </mc:Choice>
  <mc:Fallback>
    <w:pict> ... </w:pict>
  </mc:Fallback>
</mc:AlternateContent>

The three markup compatibility elements used here are as follows:

  • AlternateContent - A container for a sequence of multiple representations of a given piece of content. The program reading the file should only process one of these, and the one chosen should be based on which conditions match.
  • Choice - Essentially an if statement. This specifies a Requires element, which indicates the namespaces that the reader must support in order to successfully process all of the descendant elements. Note that some of these may be those defined in the spec, and some may be proprietary - it just means that in order to fully understand everything, the specified namespaces must be supported. Note also that these namespaces are referenced by prefix, not URI.
  • Fallback - This is what the reader should look at if none of the preceding Choice elements match. This section is only supposed to contain elements that are part of the standard.

Since AlternateContent elements can potentially occur anywhere in a file, a simple way to deal with them is to simply strip out all unsupported subtrees during or immediately after parsing. Assuming you know the set of namespaces whose elements you support, you can go through the file and strip out all but one of the children of the AlternateContent element, as well as the AlternateContent element itself. In the above example, assuming a reader that doesn't support the wordprocessingShape namespace, this would result in just a <w:pict> element.

Personal tools