CalcParser

From Apache OpenOffice Wiki
Jump to: navigation, search
PyUNO Logo

So this code mission in life is to parse the XML of OpenOffice spreadsheet. The code still needs more work. However this code will work with python 2.3.4 which is the version included in the OpenOffice.org installation.

The code uses the SAX (Simple API for XML) is a parser originally written in Java but ported to other languages such as Python.

First thing I faced here, is the change of the SAX library between versions, the original research before I create this code, was for version 2.4 and up which wasn't compatible with 2.3.

Also some of this code was for the Excel XML file scheme and I need to switch it to the OpenDocument spreadsheet XML.

The incompatibility problem was primarily with the handlers, the DefaultHandler originally used was deprecated on this libary and was handle on a separate sub-module called handler. Also it changed from DefaultHandler to ContentHandler. Please check the comment on the code.

#!/bin/env python
import sys
from xml.sax import saxutils #originally in python 2.4
from xml.sax import parse
from xml.sax import handler  # Python 2.3 uses the handler for contentHandler

The next step and more interesting was to insert the proper tags. SAX uses by default 3 definitions, startElement, endElement, characters. On this script we created 3 arrays for different levels of the data structure within the content: chars, cells and rows.

    def startElement(self, name, atts):
        if name=="table:table-cell":    #we pick the table-cell tag
            self.chars=[]
        elif name=="table:table-row":   # and the rows
            self.cells=[]

Characters is the handler that handles the content of the tags, startElement is what trigger the parser, this is allocated into the table:table-cell and table:table-row.

#!/bin/env python
import sys
from xml.sax import saxutils
from xml.sax import parse
from xml.sax import handler  # Python 2.3 uses the handler
 
# Replace DefaultHandler with ContentHandler 
# from the handler modules
class CalcHandler(handler.ContentHandler):
    def __init__(self):
        self.chars=[]
        self.cells=[]
        self.rows=[]
 
    def characters(self, content):
        self.chars.append(content)
 
    def startElement(self, name, atts):
        if name=="table:table-cell":
            self.chars=[]
        elif name=="table:table-row":
            self.cells=[]
 
    def endElement(self, name):
	if name=="table:table-cell":
            self.cells.append(''.join(self.chars))
        elif name=="table:table-row":
            self.rows.append(self.cells)
 
calcHandler=CalcHandler()
parse(sys.argv[1], calcHandler)
print calcHandler.rows
Personal tools