Difference between revisions of "CalcParser"

From Apache OpenOffice Wiki
Jump to: navigation, search
Line 59: Line 59:
 
parse(sys.argv[1], calcHandler)
 
parse(sys.argv[1], calcHandler)
 
print calcHandler.rows</code>
 
print calcHandler.rows</code>
 +
 +
[[Category:Python]]

Revision as of 04:29, 9 June 2007

PyUNO Logo

So this code mision in life is to parse the XML of OpenOffice.org spreadsheet. The code still needs more work. However this code will work with python 2.3.4 which is the version included in the OpenOffice.org installation.

The code uses the SAX (Simple API for XML) is a parser originally written in Java but ported to other languages such as Python.

First thing I faced here, is the change of the SAX library between versions, the original research before I create this code, was for version 2.4 and up which wasn't compatible with 2.3.

Also some of this code was for the Excel XML file scheme and I need to switch it to the OpenDocument spreadsheet XML.

The incompatibility problem was primarily with the handlers, the DefaultHandler originally used was deprecated on this libary and was handle on a separate sub-module called handler. Also it changed from DefaultHandler to ContentHandler. Please check the comment on the code.

[python]

  1. !/bin/env python

import sys from xml.sax import saxutils #originally in python 2.4 from xml.sax import parse from xml.sax import handler # Python 2.3 uses the handler for contentHandler

The next step and more interesting was to inser the proper tags. SAX uses by default 3 definitions, startElement, endElement, characters. On this script we created 3 arrays for different levels of the data structure within the content: chars, cells and rows. [python]

   def startElement(self, name, atts):
       if name=="table:table-cell":    #we pick the table-cell tag
           self.chars=[]
       elif name=="table:table-row":   # and the rows
           self.cells=[]

Characters is the handler that handles the content of the tags, startElement is what trigger the parser, this is allocated into the table:table-cell and table:table-row.

[python]

  1. !/bin/env python

import sys from xml.sax import saxutils from xml.sax import parse from xml.sax import handler # Python 2.3 uses the handler

  1. Replace DefaultHandler with ContentHandler
  2. from the handler modules

class CalcHandler(handler.ContentHandler):

   def __init__(self):
       self.chars=[]
       self.cells=[]
       self.rows=[]
       
   def characters(self, content):
       self.chars.append(content)
   def startElement(self, name, atts):
       if name=="table:table-cell":
           self.chars=[]
       elif name=="table:table-row":
           self.cells=[]
   
   def endElement(self, name):

if name=="table:table-cell":

           self.cells.append(.join(self.chars))
       elif name=="table:table-row":
           self.rows.append(self.cells)

calcHandler=CalcHandler() parse(sys.argv[1], calcHandler) print calcHandler.rows

Personal tools