Difference between revisions of "OOXML/Markup Compatibility and Extensibility"

From Apache OpenOffice Wiki
Jump to: navigation, search
(Initial version)
 
m (Cleanups)
Line 1: Line 1:
Although the [[OOXML]] spec defines a specific set of allowed elements, Microsoft sometimes extend this with additional proprietary elements that are specific to new versions of Office. For example, if you insert a shape into a document in Word 2013, it will be defined in terms of a "word processing shape" element structure, which is not part of the OOXML spec. For the purposes of compatibility with older versions of Word however, they include a second version of the shape which uses an element structure that ''is'' defined in the spec, albeit using the legacy VML drawing format.
+
Although the OOXML spec defines a specific set of allowed elements, Microsoft sometimes extend this with additional proprietary elements that are specific to new versions of Office. For example, if you insert a shape into a document in Word 2013, it will be defined in terms of a "word processing shape" element structure, which is not part of the OOXML spec. For the purposes of compatibility with older versions of Word however, they include a second version of the shape which uses an element structure that ''is'' defined in the spec, albeit using the legacy VML drawing format.
  
Part 3 of the OOXML spec defined "Markup compatibility and extensibility", which is the means by which a document can include multiple versions of the same piece of content in different formats. The XML is structured in such a way that applications which know how to work with elements in particular namespaces can use the more information-rich version of the content, while those which support only those elements defined in the standard can use an alternate version of the content. The XML structure is conceptually like a series of if-then statements with an else clause at the end.
+
Part 3 of the OOXML spec defines "Markup compatibility and extensibility", which is the means by which a document can include multiple versions of the same piece of content in different formats. The XML is structured in such a way that applications which know how to work with elements in particular namespaces can use the more information-rich version of the content, while those which support only those elements defined in the standard can use an alternate version of the content. The XML structure is conceptually like a series of if-then statements with an else clause at the end.
  
 
As an example, let's consider a shape object in a <tt>.docx</tt> file created in Word 2013. A the top of the file, we have the following namespace definitions:
 
As an example, let's consider a shape object in a <tt>.docx</tt> file created in Word 2013. A the top of the file, we have the following namespace definitions:
 
 
 
<source lang="xml">
 
<source lang="xml">
 
<w:document
 
<w:document
Line 12: Line 10:
 
   xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
 
   xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
 
</source>
 
</source>
 
 
Then in the body of the document, the shape is stored as follows
 
Then in the body of the document, the shape is stored as follows
 
 
 
<source lang="xml">
 
<source lang="xml">
 
<mc:AlternateContent>
 
<mc:AlternateContent>
Line 26: Line 21:
 
</mc:AlternateContent>
 
</mc:AlternateContent>
 
</source>
 
</source>
 
 
 
The three markup compatibility elements used here are as follows:
 
The three markup compatibility elements used here are as follows:
 
 
 
* <code>AlternateContent</code> - A container for a sequence of multiple representations of a given piece of content. The program reading the file should only process one of these, and the one chosen should be based on which conditions match.
 
* <code>AlternateContent</code> - A container for a sequence of multiple representations of a given piece of content. The program reading the file should only process one of these, and the one chosen should be based on which conditions match.
 
* <code>Choice</code> - Essentially an if statement. This specifies a <code>Requires</code> element, which indicates the namespaces that the reader must support in order to successfully process all of the descendant elements. Note that some of these may be those defined in the spec, and some may be proprietary - it just means that in order to ''fully'' understand everything, the specified namespaces must be supported. Note also that these namespaces are referenced by prefix, not URI.
 
* <code>Choice</code> - Essentially an if statement. This specifies a <code>Requires</code> element, which indicates the namespaces that the reader must support in order to successfully process all of the descendant elements. Note that some of these may be those defined in the spec, and some may be proprietary - it just means that in order to ''fully'' understand everything, the specified namespaces must be supported. Note also that these namespaces are referenced by prefix, not URI.
 
* <code>Fallback</code> - This is what the reader should look at if none of the preceding <code>Choice</code> elements match. This section is only supposed to contain elements that are part of the standard.
 
* <code>Fallback</code> - This is what the reader should look at if none of the preceding <code>Choice</code> elements match. This section is only supposed to contain elements that are part of the standard.
 
 
 
Since <code>AlternateContent</code> elements can potentially occur anywhere in a file, a simple way to deal with them is to simply strip out all unsupported subtrees during or immediately after parsing. Assuming you know the set of namespaces whose elements you support, you can go through the file and strip out all but one of the children of the <code>AlternateContent</code> element, as well as the <code>AlternateContent</code> element itself. In the above example, assuming a reader that doesn't support the <tt>wordprocessingShape</tt> namespace, this would result in just a <tt><w:pict></tt> element.
 
Since <code>AlternateContent</code> elements can potentially occur anywhere in a file, a simple way to deal with them is to simply strip out all unsupported subtrees during or immediately after parsing. Assuming you know the set of namespaces whose elements you support, you can go through the file and strip out all but one of the children of the <code>AlternateContent</code> element, as well as the <code>AlternateContent</code> element itself. In the above example, assuming a reader that doesn't support the <tt>wordprocessingShape</tt> namespace, this would result in just a <tt><w:pict></tt> element.

Revision as of 13:58, 20 July 2014

Although the OOXML spec defines a specific set of allowed elements, Microsoft sometimes extend this with additional proprietary elements that are specific to new versions of Office. For example, if you insert a shape into a document in Word 2013, it will be defined in terms of a "word processing shape" element structure, which is not part of the OOXML spec. For the purposes of compatibility with older versions of Word however, they include a second version of the shape which uses an element structure that is defined in the spec, albeit using the legacy VML drawing format.

Part 3 of the OOXML spec defines "Markup compatibility and extensibility", which is the means by which a document can include multiple versions of the same piece of content in different formats. The XML is structured in such a way that applications which know how to work with elements in particular namespaces can use the more information-rich version of the content, while those which support only those elements defined in the standard can use an alternate version of the content. The XML structure is conceptually like a series of if-then statements with an else clause at the end.

As an example, let's consider a shape object in a .docx file created in Word 2013. A the top of the file, we have the following namespace definitions:

<w:document
  xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape"
  xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"
  xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">

Then in the body of the document, the shape is stored as follows

<mc:AlternateContent>
  <mc:Choice Requires="wps">
    <w:drawing> ... </w:drawing>
  </mc:Choice>
  <mc:Fallback>
    <w:pict> ... </w:pict>
  </mc:Fallback>
</mc:AlternateContent>

The three markup compatibility elements used here are as follows:

  • AlternateContent - A container for a sequence of multiple representations of a given piece of content. The program reading the file should only process one of these, and the one chosen should be based on which conditions match.
  • Choice - Essentially an if statement. This specifies a Requires element, which indicates the namespaces that the reader must support in order to successfully process all of the descendant elements. Note that some of these may be those defined in the spec, and some may be proprietary - it just means that in order to fully understand everything, the specified namespaces must be supported. Note also that these namespaces are referenced by prefix, not URI.
  • Fallback - This is what the reader should look at if none of the preceding Choice elements match. This section is only supposed to contain elements that are part of the standard.

Since AlternateContent elements can potentially occur anywhere in a file, a simple way to deal with them is to simply strip out all unsupported subtrees during or immediately after parsing. Assuming you know the set of namespaces whose elements you support, you can go through the file and strip out all but one of the children of the AlternateContent element, as well as the AlternateContent element itself. In the above example, assuming a reader that doesn't support the wordprocessingShape namespace, this would result in just a <w:pict> element.

Personal tools